logo

Programming Model

Posted on: 2023-01-16

Programming Model

Conclusions:

Taxonomy

In a rough ordering from least to most parallel/complex.

In this ordering we move from "pointers common" to "no pointers". Binding in GPU SIMT binding buffers and other resource provide something "pointer like". "Bindless" also provides a limited pointer like behavior, with resources.

Heterogeneous - special purpose engines. GPU SIMT has access to such engines in the language such as texture filtering hardware.

Summary of Thoughts

Discussion

On looking at ISPC, languges with SIMD support such as odin and zig and thinking about GPU and typical CPU programming models raises a bunch of issues.

ISPCs programming model is a more strict form of SIMT. Programming is mostly dealt with scalar type programming, and then some amount of those opertions happen across larger vector registers. How many is dependant on command line compile options and on the hardware available. It is stricter than GPU, because there is a single PC, whereas modern GPUs have a PC per lane. ISPC also allows for a more explicit form of mixing uniform (scalar like) and varying code. I don't know if there is anything analagous on GPU.

Many of these languages allow writing SIMD-like code even if that makes no difference to performance. On most modern GPUs writing code in HLSL appears as if vector and matrix types are hardware native and yet in paractice it is common for this to be boiled down to scalar code. Originally this might have been because some GPUs did have SIMD instructions. Having SIMD like operations not only allows for the potential for faster code if that is the target, it also is often quite convenient to write for many applications. Taking SIMD like code and making it scalar is trivial. Taking scalar code and making it efficient SIMD is hard.

The GPU programming model has some notionally nice properties. Most of the time code can be written as if is scalar/2-4 wide SIMD. If a GPU comes along with more lanes then the code has the potential to run more efficiently. How true this is in practice is perhaps somewhat debatable. AMD has 64 lanes typically and nVidia 32. Any code that communicates across lanes is likely to have to be written to a specific width. More recently extensions have been made to allow 32 lanes use on AMD hardware. In HLSL having 64 lanes is somewhat problematic, beause of the apparent desire (need?) to store a mask in a uint2. Writing code that could work with either is of course possible, but is going to be stepping outside of typical language constructs.

The fixed GPU warp size seems like a kind of specialized SIMD. It seems with using double you still get the same amount of lanes as with float. If you use float16 though you end up with the same amount of lanes, but an instruction that can issue 2 float16 ops. On a typical CPU SIMD the register size is typically fixed and so the amount lanes is dependent on the contained type. As earlier aluded to RISC-V is slightly different in this capacity as the size can be specified, and the intruction multiply pumped to produce the suitable result. Something similar seems possible with Scalable Vector Extensions (SVE) on ARM.

On machines that have hardware mask registers, operations can be performed, ignoring lanes that aren't enabled. As long as the mask with is big enough to hold the maximum amount of lanes (or some multiple of) things are somewhat simplified and faster to execute.

Interestingly AVX512 mask registers appear to 'normally' be 16 bits. That seems a little on the small side, which is probably why there are AVX-512DQ instructions that be be up to 64 bits.

Architectures

We want a programming model that can work for many different kinds of computing architectures.

In terms of implementing parallelism/concurrency building blocks there are many options

Not all of these can be applied to different architectures. A CPU can typically support them all. GPU might only be able to implement a subset:

GPUs may support some SIMD like operations. On nVidia GPUs, half support is make available via half2 type, that works on a single lane. Some vendors GPUs previously supported SIMD, but now operate on scalar values. Some GPUs (seemingly Intel), are predicated SIMD systems.

As a point of policy it we would prefer to be able to define different programming styles where appropriate within the language. We will need to expose the appropriate building blocks to be able to build support.

On the one hand we want to have powerful abstractions, such that programmers can develop efficient code without worrying too much about the details. Abstractions are not the end of the story though - being a systems level language we need to be able to expose the mechanisms of the hardware. Additionally we need to have the controls available such a programmer can inform the compiler to honor how code was written and not optimize it in certain ways. In essence there needs to be escape hatches where a programmer can say "I know what I'm doing" so within some constraints "do as I say".

Writing code in the abstractions should in most situations produce code that is as least "sensible' for the target if not always optimal.