Programming Model

Posted on: 2023-01-16

Programming Model

Conclusions:

We want a programming model that works well across these targets
Turning something SIMD into scalar is trivial
GPUs have some SIMD like instructions
Emulating flow control on SIMD, needs care
How do we represent intermixing of scalar/SIMD uniform/varying
Do we allow testing to coalesce, detect aligned load etc?
We perhaps want a model that can separate from memory
- There can be lots of different representations
- Registers is where the performance is
- Memory is kind of the lowest common denominator
- Memory has aliasing/sharing issues

Taxonomy

In a rough ordering from least to most parallel/complex.

Scalar (single processor)
Scalar (multi processor - threads, mutexes etc)
Scalar better abstractions (async, thread pools, futures, queues)
CPU auto vectorization
CPU SIMD
CPU SIMD + multi-processor
CPU SIMT-like (as in ISPC)
GPU SIMT
- CUDA like
GPU SIMT
- HLSL
- GLSL

In this ordering we move from "pointers common" to "no pointers". Binding in GPU SIMT binding buffers and other resource provide something "pointer like". "Bindless" also provides a limited pointer like behavior, with resources.

Heterogeneous - special purpose engines. GPU SIMT has access to such engines in the language such as texture filtering hardware.

Summary of Thoughts

We want to be able to write scalar code that can execute as SIMD (ie can synthesize SIMD version)
- Much like ISPC does
We want to be able to write vector code
- Ideally can run as intended on SIMD instructions
- Can be converted into scalar code for targets where that is appropriate
The ability to write vector and (<=4) matrix operations is very convenient
It can also be efficient on some hardware, and can be trivially transformed into scalar code
- The reverse isn't true
The GPU fixed lane model doesn't match SIMD where type can vary the amount of lanes
GPU programming model probably doesn't want to be fully emulated (see ISPC)
Paradoxically the GPU does have SIMD like behavior for some types(!)
It's not clear how to reconcile all these different requirements
Typical SIMD condition registers are bit masks
- We need ways to quickly convert to other bit sizes
With both SIMD and SIMT there can be big performance boosts with communicating between lanes
It is possible to emulate some SIMD like operations on SIMT
- Using SWAR like techniques for example
There is a fall back position on SIMT for smaller types, of just using wider types (and the natural lanes size)
- For example a int8_t4 could be emulated in an int, or as int4
We can also scale up lanes on SIMD, by using multiple registers
- Say we are processing bytes, we could use 16 wide SIMD, and use 4x4 int registers to work as a single 16 int register
GPU SIMT has a bunch of other stuff
- Buffer types - can convert
- Texture types (!)
- Automatic handling of masks
- Graphics hardware - including raytracing
- Rapid switching if stalled
AOS and SOA for SIMD - how is it handled?
- For GPU like models we can hide some of it behind buffer types
- Do we want to expose it (like in ISPC )?

Discussion

On looking at ISPC, languges with SIMD support such as odin and zig and thinking about GPU and typical CPU programming models raises a bunch of issues.

GPU warp sizes are fixed
GPUs SIMT have a variety of subtle differences from SIMD
CPU SIMD for the moment has to emulate SIMT like programming models
Future SIMD like instruction sets may change things
- Specialized condition/mask registers
- RISC-V allows different SIMD sizes as a prefix
GPUs have some operations that are SIMD like (such as float16)

ISPCs programming model is a more strict form of SIMT. Programming is mostly dealt with scalar type programming, and then some amount of those opertions happen across larger vector registers. How many is dependant on command line compile options and on the hardware available. It is stricter than GPU, because there is a single PC, whereas modern GPUs have a PC per lane. ISPC also allows for a more explicit form of mixing uniform (scalar like) and varying code. I don't know if there is anything analagous on GPU.

Many of these languages allow writing SIMD-like code even if that makes no difference to performance. On most modern GPUs writing code in HLSL appears as if vector and matrix types are hardware native and yet in paractice it is common for this to be boiled down to scalar code. Originally this might have been because some GPUs did have SIMD instructions. Having SIMD like operations not only allows for the potential for faster code if that is the target, it also is often quite convenient to write for many applications. Taking SIMD like code and making it scalar is trivial. Taking scalar code and making it efficient SIMD is hard.

The GPU programming model has some notionally nice properties. Most of the time code can be written as if is scalar/2-4 wide SIMD. If a GPU comes along with more lanes then the code has the potential to run more efficiently. How true this is in practice is perhaps somewhat debatable. AMD has 64 lanes typically and nVidia 32. Any code that communicates across lanes is likely to have to be written to a specific width. More recently extensions have been made to allow 32 lanes use on AMD hardware. In HLSL having 64 lanes is somewhat problematic, beause of the apparent desire (need?) to store a mask in a uint2. Writing code that could work with either is of course possible, but is going to be stepping outside of typical language constructs.

The fixed GPU warp size seems like a kind of specialized SIMD. It seems with using double you still get the same amount of lanes as with float. If you use float16 though you end up with the same amount of lanes, but an instruction that can issue 2 float16 ops. On a typical CPU SIMD the register size is typically fixed and so the amount lanes is dependent on the contained type. As earlier aluded to RISC-V is slightly different in this capacity as the size can be specified, and the intruction multiply pumped to produce the suitable result. Something similar seems possible with Scalable Vector Extensions (SVE) on ARM.

On machines that have hardware mask registers, operations can be performed, ignoring lanes that aren't enabled. As long as the mask with is big enough to hold the maximum amount of lanes (or some multiple of) things are somewhat simplified and faster to execute.

Interestingly AVX512 mask registers appear to 'normally' be 16 bits. That seems a little on the small side, which is probably why there are AVX-512DQ instructions that be be up to 64 bits.

Architectures

We want a programming model that can work for many different kinds of computing architectures.

Scalar/MIMD (CPUs potentially with multiple cores)
SIMD (NEON/SSE)
SIMD with predicates (AVX2, SVE/2 on ARM)
Variable width SIMD (SVE/2 on ARM, and RISC-V)
SIMT (GPU)
Graphics
- Traditional pipeline
- Ray tracing
Heterogenous (specialized additional hardware)
Instructions
Co-processors
Other

In terms of implementing parallelism/concurrency building blocks there are many options

Atomics
Thread APIs (such pthreads)
Lock free algorithms
Thread queues
Futures
Continuation
Message passing
Actor

Not all of these can be applied to different architectures. A CPU can typically support them all. GPU might only be able to implement a subset:

Atomic
Some lock free (CAS)

GPUs may support some SIMD like operations. On nVidia GPUs, half support is make available via half2 type, that works on a single lane. Some vendors GPUs previously supported SIMD, but now operate on scalar values. Some GPUs (seemingly Intel), are predicated SIMD systems.

As a point of policy it we would prefer to be able to define different programming styles where appropriate within the language. We will need to expose the appropriate building blocks to be able to build support.

On the one hand we want to have powerful abstractions, such that programmers can develop efficient code without worrying too much about the details. Abstractions are not the end of the story though - being a systems level language we need to be able to expose the mechanisms of the hardware. Additionally we need to have the controls available such a programmer can inform the compiler to honor how code was written and not optimize it in certain ways. In essence there needs to be escape hatches where a programmer can say "I know what I'm doing" so within some constraints "do as I say".

Writing code in the abstractions should in most situations produce code that is as least "sensible' for the target if not always optimal.