ISPC Programming Language

Posted on: 2023-01-011

Conclusion

ISPC is interesting. It's perhaps worth saying it doesn't fit in the general purpose language space. You wouldn't develop your whole application in it, you would accellerate parts of it. It has a lot in common with CUDA, where CUDA is a GPU C++, ISPC is a SIMD C.

The programming model is GPU like, but isn't actually the same as modern GPU implementations. ISPCs is more constrained and defined. The language allows an implicit intermixing of SIMD and regular code. For example if a comparison is performed on uniform variables, and that sets a uniform, the result is regular non SIMD code.

The GPU-like programming model provides an abstraction such that for most of your code can be written conventionally.

The language tries to abstract over the size of a gang. I wonder how this really works in terms of generated code - because a fixed gang perhaps doesn't abstract that well over different SIMD types.

As a programming model there is a lot to like. I think that I would want to be able to write SIMD explicit code in some simplified manner. The GPU-like programming model abstraction has a lot going for it, but I think there are some significant advantages at times being more explicit.

In terms of the uniform and varying keywords use in terms of the type system, has sensible behaviors. I do find myself trying to map the concepts on to whats going on underneath, and finding it somewhat confusing. Perhaps using the language on real tasks would lessen this.

Other than programming model differences, the other significant programming model difference is GPU languages like CUDA exposing GPU buffers, like textures. Textures may not make sense, due to the complexity of implementation, and the lack of dedicated hardware. Other kinds of buffers though would provide a mechanism to communicate with regular CPU code that would allow for AOS or other transformations.

There is alot to learn from ISPC, not least that it is a mature real world language targetting SIMD.

Interesting

Is in effect a language that sits alongside C/C++
- Produces header files for extern functions in C
Uses LLVM
Looks as if can target GPU
Supports x86/ARM arches
During compilation gang size and target are specified
- Addressing defaults to 32 bit (not pointers, but perhaps offsets?), not clear why...
Features of the compilation environment are communicated via preprocessor values like TARGET_WIDTH

Good

Has support for short vector types
- May not change performance but is convenient
- Has HLSL like shuffling syntax (with .xyzw and .rgba)
- And indexing
Has soa<N> T to produce SOA version of T
Solves real world SIMD problems
Has some 'escape hatches' to allow more control, such as the active mask and marking as notmasked
Standard library seems pretty complete and understandable
Mature and being actively updated

Not sure

The tasking system to execute across cores
SOA - it's not abstracted, so how do we know the appropriate fixed size depending on target/gang.
- Is it always some multiple of gang width?

Bad

Ugly

Discussion

Perhaps worth taking on board the description of how PC and execution work, as it somewhat formally explains how it works from a programmers point of view.

The definition of maximally converged is useful and an interesting contrast to CUDA/OpenCL. It helps in thinking about how execution works mentally. That said it is something that aligns closer to how CPU SIMD works than how modern GPU SIMT works (with each thread having it's own PC).

It is an error to try to assign a varying value to a uniform variable, though uniform values can be assigned to uniform variables. Assignments to uniform variables are not affected by the execution mask (there's no unambiguous way that they could be); rather, they always apply if the program counter pointer passes through a statement that is a uniform assignment.

Hmm. Presumably you could write a varying value to a uniform. If the mask was empty, nothing would happen. If one thread is active. If more than one was active one would succeed. This would work for values which can be written to atomically, but not for say a struct.

Perhaps the idea is to have assignment work this way for a language feature such that it is indentified. Not that there is no conceptually way to make this do something "sensible".

I guess this is what the second part means: An earlier part says you can't assign something varying to uniform variable. Ok. I suppose if it is guarenteed that the PC can only pass through an instruction if one or more threads is active (as is touched on earlier in the doc), then if the PC hits a store to a uniform, it must write the value.

In the model implemented by ispc, any side effect from one program instance is visible to other program instances in the gang after the next sequence point in the program. This is a significant difference between ispc and SPMD languages like OpenCL* and CUDA*, which require barrier synchronization among the running program instances with functions like barrier() or __syncthreads(), respectively, to ensure this condition.

This also comes from the different execution models. Its nice that this is well defined, and in some respects simpler. On the other hand what does it mean if these ideas are applied to other targets? Is it possible to determine when to add the sync points? If not it implies adding such sync points and ignoring them when these guarentees are available.

Adds launch and sync, with the guarentees that a function that launches won't return until it's launches complete. Having this restriction allows composition in that callers don't need to consider if the callee is doing aync launches as all such launches must have completed when returning to the caller.

Theres some subtlety around pointers, and uniform and varying. Take

float a;
varying float * uniform pa = &a;
*pa = programIndex;  // same as (a = programIndex)

pa is a uniform pointer - so all threads have the same pointer, but it's pointing to a varying float, meaning that there is a different float for each thread. You could perhaps think of this as something like

Simd<float, gangSize> a;
Simd<float, gangSize>* uniform pa = &a;
(*pa)[programIndex] = programIndex;  // same as (a = programIndex)

Seems as if you can use arithmetic on enum types(!).

Has an "unbound" concept around struct members. If uniform/varying is not defined on the member item it is unbound, and the variable will take on the varying/uniform property of the usage.

Has foreach_active it's to work around race conditions that could occur, by serializing operations over active items. Can use local atomics as another option for some scenarios. Doesn't define an ordering on active execution (this is okay in the example given, but might not in others).

Using assume can tell the compiler about alignment.

Read

Questions

How is Memory Handled?

In the example we have arrays passed from a C program as uniform, and then the use of foreach to iterate over the items. But it says nothing about the alignment of the input. That could be computationally costly. On the other hand by not specifying it allows for different SIMD widths to work transparently.

SIMD Register Width and Gang Size

I guess I have worries about the gang abstraction in general. Whilst it abstracts away a specific width - that sort of doesn't jive too well with the typical SIMD implementation. When targetting AVX or SSE you end with a different width. You can choose larger widths than SIMD registers - doing so could be seen as a kind of loop unrolling, but it could also allow for better use of SIMD registers.

As an example, lets take SSE. If I have ints, then its "obvious" to use a width of 4 or some multiple as registers in SSE are 16 bytes. In my ISPC code I might want to use bools, or uint8_t. Lets go with uint8_t, well for that processing I can fit 16 into my register. So should I have a gang of 16? Or 4? If a gang of 4 was used, you could (with some caveats - that could expose the extra bits) just use uint. Which is fine and would work, but performance is a 1/4 of what it could have been. Going the other way we could have a gang of 16, so now we fit our uint8_ts into one register, but we require 4 SSE registers for the ints. That increases register pressure, and reduces granularity over doing a computation, to the minimum size of 16. The performance manual seems to hint that the preferred solution is to use 32 bit types.

Related to this issue are masks. Masks are implictly associated with the type that produced them. A mask for SSE for 8 bit types will have 16 byte sized values, for int it will be 4. This means that masks can't be used interchangably between different types, they will need to widened and shrunk as required. Perhaps this is partly why 32 bit types are preferred.

It seems that there are only limited gang sizes available. Gang width is associated with the "target".