logo

Pointer Generalization

Posted on: 2023-06-03

Pointers are a core feature of many programming languages. In C/C++ languages raw pointers boil down to hardware mechanisms. What I'm going to discuss here is not the low level concept of a pointer, but a variety of related pointer like behaviors. It may make sense from a language point of view to make them distinct concepts from "pointers", but it seems worth discussing as the ideas are related.

I think one key language feature I would want to see is support many different layouts of a type being usable interchangeably, with the compiler implementing the mechanisms to access, iterate and convert such that usage is simple and performant.

What is a pointer?

If we wanted to generalize we could perhaps say it's "an indirection". The use may be for convenience - a way to specify something in a different way. It may be way to simplify logic, by allowing other sections of code to access some data without having to know the path to get to that specific bit of data. It might be used to share data. In sharing it also allows communication. In a single threading program the sharing is straight forward as it's possible to read and write in different sections of code and the changes be seen in a temporally consistent manner. With multiple cores, this is typically no longer the case and other mechanisms are needed to coordinate between cores such that there is consistency.

Pointers and Hardware

In this section I'll use pointer to mean what is normally a "raw" pointer in C.

With a pointer the indirection is to a memory location, the value being a number which is typically the byte offset to the data aka "the address". With typical modern processors the address is not a physical address, but a logical address that is converted to a physical address via a Memory Management Unit (MMU). The use of the MMU does mean that the same physical address could appear at multiple logical addresses. It is perhaps worth saying that logical addresses don't need to map to physical memory at all - it could be being used for memory mapped I/O, or perhaps virtual memory where a particular page is not in memory.

A pointer doesn't necessarily have to be just a logical (or perhaps physical) address. The CHERI standard provides mechanisms to store information within a pointer to allow hardware support for checking pointer usage is valid. This interestingly also lead to 128 bit pointers, which are now larger than the typical 64 bit word size.

A programmer might use other bits of a pointer to indicate other information.

Whilst typical CPU pointers work in isolation (you can read and write their contents) and do not need any extra data typically, it's not always the case.

Some processors have multiple address spaces. Two real scenarios that spring to mind are the scratchpad on the playstation 2 and the SPE memory of the cell processor of the playstation 3. Scratch memory is a section of address space that is backed by faster memory. It is accessed via regular addressing. The SPE had local memory that is not directly visible to the PPE. Data can be DMAed between the the SPE/PPE, under programmer control. This created considerable complication from a programming point of view.

GPUs have many different address spaces - thread local, group local, shared, global. Resources could be considered as a kind of programmable address spaces.

One popular instruction set x86_64 has it's pointer behavior as somewhat of an illusion in most programming, because a logical address is actually the combination of a segment register and the pointer. It just so happens that in most scenarios the segment registers are setup to map to the same physical addresses. Whilst not exposed directly the segment register differences are used - for thread local memory on windows for example.

Many of these differences are not exposed in languages developers use. If we want a systems level language that can produce performant code it would be helpful if

Composite Pointers

There are software tricks that allow for pointers to be used in other ways. On the Atari ST only 4MB of memory could be addressed, but addresses were 32 bit. This led to some software using the other bits for other purposes. Tempus is one well known product. Something akin to this has reoccurred on 64 bit processors where data can be stored in "unused bits".

A less controversial trick is to use the fact for many types, alignment is a requirement, and this leads to the lowest few bits being available for other uses. One use is as a tag identifying what kind of thing is being pointed to. Since such uses are outside the visibility of the type system it makes debugging more difficult. Whilst this is a little bit of a digression, it hopefully shows some of the breadth of "pointerness" even with "raw" pointers.

Ideally a systems level programming language would provide

Such tricks whilst not that unusual do have significant issues

In C++ some programming convenience and type safety can be achieved using a template. On some debuggers it is possible to write code to make visualization work such as with natvis in visual studio and there is code execution within gdb.

Indices and Handles

Using hardware pointers can cause problems

For data orientated programming (DOD) how memory is used becomes of critical importance. Limiting the amount of memory that is required can lead to large performance increases.

Some of these issues can be addressed with relative pointers. They address the serialization issue and can allow for smaller "pointer" sizes depending on how far away a relative pointer points. Whilst this is useful, and something you'd hope a system orientated programming language would support, it also implies all of a data structure is stored in contiguous chunk of memory, which limits it's practical uses.

So indices to the rescue! The advantages

An index doesn't work in isolation, it will need to be combined with some other structure in order to access data. For example an array.

An argument could be made, because an index can't just be used to access data on it's own it's not really a pointer. If you read the section on hardware pointers, that distinction doesn't entirely hold, because on x86 a pointer is combined with other data - the segment register. On GPU architectures the combination of resource and index are typically your pointers. Or using "bindless" on GPUs where are resources themselves are accessed in a pointer like manner, the pointers are indices.

Perhaps yet another way of looking at this would be that the "context" changes the meaning of a pointer. A pointer on a PPU and a pointer on an SPU don't point to the same thing, they are in different address spaces. So SPU-ness or PPE-ness could be seen as a global prefix on the pointer.

Whilst perhaps a little belabored the point I'm trying to make is that indices and hardware pointers are actually quite similar. Systems programming languages whilst low level do not often make the nuance around pointers visible. Languages would benefit from having type systems that would support using indices and handles being used in pointer like ways.

The main issues with using indices

Before digging further, lets talk a little about handles. A handle is an opaque type to data. In some scenarios the value contents of handle is the data. That's interesting and useful, but is not pointer like - the handle is a value type. Sometimes it really is a pointer, just with the type information removed. This is a useful abstraction in some scenarios, but in essence most of what is applicable to pointers is applicable to such handles. There is perhaps an argument about wanting to have language support for some type safety around such a representation, such as across an API boundary automatically converting between the handle and the internal representation.

A combination of an index with some extra data allows for a pointer-like type that handles dangling pointers. There are many ways to do this, but the most common is to associate an index with some kind of generational count. When an item at an index is destroyed, the generational index associated with that index is increased. Any remaining handles can detect when used if the item reference has been destroyed by checking if the generational counter stored with the index is the same as the items. If it isn't the handle is no longer valid. The index can be reused because the generation value will identify which 'version' of the item is being referenced. There is a limitation here in that the amount of bits for generations is typically limited, so it is perhaps possible (without some other logic) to reference to a destroyed thing to use the "new" thing. For many applications, for example games, this may not be an issue.

Something akin to generational handles is possible with indices too, if we split up indices into different states. If we can traverse all items, and their usage of other items we can detect if they reference something destroyed, and mark them as null. Once we do this we are then free to reuse the index. More concretely we could have

Using these rules, when we get to a certain point we can garbage collect. We go through all references, if they reference something that is dead we change the reference to 0 effectively null. We then copy all of the indices in the dead list into the free list.

Something similar to this can be achieved with generational indices, when the generational counter hits some maximum a garbage collection is issued. The problem with this for many applications is that such a collection could be slow and happen at somewhat arbitrary times which is why for games, it's more common to just live with a uncommon hit scenario.

With the generational counter mechanism, an indirection will require extracting the index, doing a lookup and then comparing the generational count. Such operations are likely fast, but it is extra overhead.

Layout

In C++ a pointer points to a type that has some consistent in memory representation. That same representation will typically be used for any in memory representation. If the structure is in registers or partially in registers, that is typically a transitory state and the in memory representation is the ground truth. If a pointer is taken to such a representation, and a function is called with it, before calling the function (unless say inlined), it will typically require anything held in registers be written back to memory.

Unfortunately memory is slow, certainly compared to registers.

The normal C++ representation may be far from optimal. Most modern processors have some kind of SIMD capability. The C++ array of structs representation is typically sub optimal representation. Doing a conversion between AOS -> SOA, is typically costly enough to wipe out any performance benefit. What kind of performance advantage is possible with SIMD friendly layout? Well getting close to 4x is common with a typical 4 wide SIMD implementation, and it's not uncommon for a variety of reasons to do better.

There are a variety of problems here

When I'm talking about SIMD SOA like representations I typically mean interleaved SOA.

Leaving aside the SOA, sometimes it is advantageous to use a "SIMD aware" AOS representation. Take for example

struct Thing
{
    float3 position;
    int32_t index;
};

// A SIMD aware representation... something like

typedef float4 Thing;
Thing makeThing(float3 pos, int index) { return ...; }
float3 getPosition(const Thing& thing) { return replaceW(thing, 0) }

/// An interleaved SOA like representation. Each struct holds 4 Thing
struct Thing
{
    float4 x;
    float4 y;
    float4 z;
    int4 index;
};

// A SIMD SOA like representation, can hold any multiple of 4.
// Unless there is a special reason interleaved SOA is 
// typically better, better cache locality, only requires a single pointer 
struct Thing
{
    float4* x;
    float4* y;
    float4* z;
    int4* index;
}

I could encode this a single 4 wide SIMD register. Care would need to be taken around index. If the index is 23 bits, the value could just be stored in a valid float. Most SIMD systems allow for the interchange of meaning of the lanes without cost. If the full 32 bits are required, it could just be stored using the float bits, and then when operations are performed on position, that lane can be masked. There may be a performance issue if the int bitcast as a float is a NAN or denormalized, so care may be necessary.

What would really be helpful would be to free the compiler to choose a suitable representation depending on usage and target. Ideally being a systems programming language it should be possible to specify a particular representation and let the compiler convert to different forms, as is required. In the language we could specify Thing in the first example, but could convert and access other representation as if it is the original. This including support for iterating such that multiple items can be processed in parallel.

What should a pointer to Thing look like if we allow this? For a AOS, the pointer just points to the struct, but for the others this doesn't work. The most simple answer that always works, is a pointer is an index. There are other options

Packing

There is an issue around how many elements packed into into a fixed array "interleaved" representation. The most "obvious" answer is the SIMD size, but this leads to wasted space due to typical alignment issues with SIMD types. Take for example

struct Item
{
    bool isActive;
    float value;
};

// The "obvious" 4 way sind version
struct Item2
{
    uint8_t isActive;           /// We use a bit for each lane
    float4 value;
};

// Perhaps the isActive value is being used 
// alot and directly within processing so you might end up with
struct Item3
{
    uint4 isActive;         ///< 0xffffffff if true, 0 if false
    float4 value;
};

Item3 might be appropriate for some scenarios if masking is the main mechanism that is going to be used. On most systems a bool is a byte. By making it a mask it's now 32 bits (!). Item2 uses the smallest type that can hold the necessary bits - a uint8_t, but unfortunately because float4 is probably 16 byte aligned, it actually takes up the same amount of space as Item3. AVX2/AVX512 has predicate registers, where the bit representation is probably more appropriate. If we only have 4 bits, we could use a table or do some masking to turn bits into a SIMD register.

auto mask = broadcast<uint4>(isActive);
mask &= uint4(0x8, 0x4, 0x2, 0x1);
mask = (mask != 0);


// Or with a table
auto mask = g_table[isActive];

Which is faster depends on the arch. As the amount of bits needed (with wider SIMD), the less attractive the table becomes.

To address all this waste, it seems reasonable to expand the amount of items. We can have code act as if it has a wider SIMD width.

struct Item4
{
    uint8_t isActive;
    float4 value[2];
}

Now all bits of isActive are used. Item4 becomes 48 bytes instead of 32, but now it holds 8 elements. The more we widen, the less waste due to padding there is - from 32/4 = 8 bytes per item to 48/8 = 6 bytes.

There are several issues

On the last point the divide might not be too terrible, because it will generally be known at compile time and most modern C/C++ compilers use a multiply trick to avoid a divide. We also need the remainder - so I think that will probably require another multiply. Whilst that is not as bad as a divide it's still considerably worse than a shift and and a mask. When iterating through items, the index conversion can be avoided. So depending on access patterns it may not be too bad.

In general though it is desirable for the the amount of items held to be

Additionally we may want to rearrange elements to make packing less wasteful. This is true generally speaking and some newer languages make such rearrangement the default. Zig springs to mind.

For example if we had

struct AnotherItem
{
    bool isActive;
    float value;
    bool isGood;
};

// On a CPU (unless required for some ABI/other reason) we'd be 
// better off if this was

struct AnotherItem2
{
    bool isActive;
    bool isGood;
    float value;
};

// And similarly when making interleaved SOA, we could do

struct AnotherItem3
{
    uint8_t isActive;
    uint8_t isGood;
    float4 value;
};

Once we have divided the index by the amount of items in a 'Block', to actually access the element, we need to go the start of the block that holds the element, which is typically a fast (perhaps shift and add) multiply, and then the remainder is used to lookup the actual item within the block.

Partial

When we used interleaved SOA, we pack together some fixed number of elements which I'm going to call a "block" for want of a better name. Typically though the amount of elements in an array is not going to be a multiple. If we make index 0 the start of a block, we only need to worry about partials at the end of the list. We can't always have the start be index 0 like this though, for example when we have a slice, how to deal with this will be discussed in semantics.

Even if index 0 is the start of the first block, if we want to operate on sub section, we may have one of

If we had a mask to control what lanes are operated on, we could have two versions of the "work" of the loop. One that does a whole block and one that does a block with a mask. Whist that works, it may also be the case that for very small sub sections it is actually more efficient to extract the item and do the work in a scalar fashion.

Operating on a whole block with masking doesn't seem trivial. If there is some operation where a result is written back we will need to merge based on the mask. If we have some operation that is some kind of reduction (sum/max/min etc), then we need a version of that operation that honors the mask. How the mask works will be dependent on the operation. If we want to produce an index, that will also need to take into account the operation and the mask.

Semantics

Pointers can be used in many different ways. In C++ we also have references, which whilst not quite the same as pointers in a variety of ways, much of what follows is applicable. A reference can crudely be thought of a non assignable, non nullable pointer. This isn't quite right, because a const reference, can be treated as a value.

Using pointers to make some implementation opaque is useful, just as object orientated programming remains useful. Using reference counting is a reasonable lowest common denominator way to do things that is simple and effective. Such mechanisms are not useful on all targets, but on CPU like targets they do the job. Such an abstraction is perhaps somewhat heavy weight, requiring virtual function calls, and typically atomic reference counting. For targets and scenarios where these aren't a problem it's a good solution, and something I won't touch on more here.

A good systems programming language will allow the use of a wide variety of abstractions. It will also allow going down close the metal. So lets explore some common scenarios, in the context of one of many items, that is indexable. The distinction is made, because if there is just one item, it can just be a single struct.

It's perhaps worth adding that in/out and inout have specific semantics - they are not the same a pointer/reference in general. The behavior makes it clearer in code what the intent is as well as allowing a compiler to check for aliasing scenarios, and apply more optimizations because the semantics typically mean there is no aliasing which could cause problems over say reading and writing.

Single Item

We have a few scenarios

  1. Hold the pointer to the start block, and the index from the start
  2. Hold the pointer to the element block and a sub block index
  3. Have a pointer contain both the pointer to the element block and the sub block index aka composite pointer
  4. Pointer to the start block, the block index, and the sub block index

2 has the advantage over 1, that we don't need to calculate the offset to the start of the block that contains the item. The cost of that calculation depends on the cost of dividing/remainder of the amount of elements in a block, plus the cost of doing the address calculation for the block. If the amount of elements is a power of 2, the cost is fairly low but probably not negligible.

3 only works if there are enough bits available to store with the pointer. It's perhaps not a bad solution, in that an item can be passed in a single pointer.

4 is a kind of compromise, in that it doesn't require the full index to block index calculation, but still requires offsetting.

I suppose it might be desirable to compare pointers for ordering. If we assume there is no overlap they are all relatively fast.

bool isLess1(Ptr a, Ptr b)
{
    return a.start < b.start || 
        (a.start == b.start && a.index < b.index);
}

bool isLess2(Ptr a, Ptr b)
{
    return a.block < b.block || 
        (a.block == b.block && a.subIndex < b.subIndex);
}
bool isLess3(Ptr a, Ptr b)
{
    return a < b;
}

bool isLess4(Ptr a, Ptr b)
{
    return a.start < b.start || 
        (a.start == b.start && a.blockIndex < b.blockIndex) || 
            (a.start == b.start && a.blockIndex == b.blockIndex && 
                a.subIndex < b.subIndex);
}

I think it's reasonable to rule out 4 just because of the extra complexity.

All things considered, 2 and perhaps 3 seem the most attractive. The pointer comparison (and equality?) behavior perhaps favors 3 but only works if we can represent the block and the subIndex in a single pointer. It would be fast to convert 2 into a pointer (or a pointer sized int) again that only works though if we can combine them. It's also worth saying that using 3, any access will require some extra ops to extract the block and the subIndex, which isn't necessary with 3.

On the other hand with 3, it cannot be held in a register, and the subIndex only holds very few bits, so seems somewhat wasteful.

Slice

A slice specifies a range of items, again we have a few choices.

  1. A start and end single pointers (however represented)
  2. A start block with a start and end index
  3. Start element block, and sub index and a length

For a slice we require the start must point to a element in a block, but the end can point to the start of the next block.

As well as being able to iterate over items, it is typically important to quickly be able to get a count. We can calculate the count trivially for 2 and 3. On 1 if the block sizes are not power of 2, it could be considerably more costly. Something like

// Block of some amount of elements in SIMD form
struct Block
{
    enum { ElementCount = ... };
};

Count getCount1(Slice slice)
{
    auto startBlock = slice.start.block;
    auto endBlock = slice.end.block;
    if (startBlock == endBlock)
    {
        return end.subIndex - start.subIndex;
    }   
    // Probably requires a divide...
    return (endBlock - startBlock) * (Block::ElementCount + 1) - 
        slice.start.subIndex + slice.end.subIndex; 
}

Count getCount2(Slice slice)
{
    return slice.end - slice.start;
}

Count getCount3(Slice slice) { return slice.count; }

Another operations might be

With 2, this is trivial. With 3, it depends, but will require some work to calculate block and subIndex, this seems reasonable enough, and would be required for any access anyway.

With doing [], 2 is trivial. With 3, is close to the same except we add the start subIndex, and do the lookup as if the block pointer is the start.

A nice-ish property of 3, is that it is in effect the same as single pointer option plus a length, so can trivially be converted between.

So probably 3 is preferable.

Pointer arithmetic

Most common pointer arithmetic operations are comparisons, ++, --, +=, -=

Lets assume the pointer that can be used for arithmetic is the same as a single pointer (in terms of representation - we may want a different syntax).

If we assume 2, then ++, += or --, -= require checking changing the block pointer. If the amount of elements held in a block is a power or 2 this makes this somewhat easy.

Ptr::operator-=(in out Ptr ptr, Index delta)
{
    Index index = ptr.subIndex -= delta;
    ptr.block += (index >> shift);
    ptr.subIndex = (index & mask);
}

I suppose there is an issue around subIndex in this encoding only needing a very small amount of bits.

Bounds checking

(TODO)

GPUs and CPUs

The discussions here so far have largely been around SIMD and CPUs. GPUs have the convenient property that many of this issues are handled invisibly via the languages used and in hardware. So what does this look like from a programming language model point of view? What would equivalent code generated for GPU look like?

(TODO)

Other ideas

Combinatorial explosion of Scenarios

One aspect of this discussion is that there are many ways to get get some code base to work effectively, but how to do that changes depending on the hardware of the target.

There are several ways this might be approached

  1. Have a language that represents the low level behavior and allow building of abstractions
  2. Have a compiler that understands the different targets
  3. Have mechanisms to "markup" declarations that say what is desired for a target

All are to a degree applicable.

C++

C++ is most like the 1, and it is possible to create some of the kinds of transformations described here through template meta programming. It is also possible to use the pre-processor, or even switch out different implementations depending on a target.

C++ compiler is designed to compile efficient code for a particular target. There are switches to control optimizations and assumptions. This works at the level of the C++ abstract machine, which doesn't closely fit the kinds of hardware that is available. It doesn't closely fit for SIMD or for GPU for example.

It is possible to provide some markup around conditionals or if inlining of a function is desired. The problem is the programmer often has to guess as to whether such a markup is suitable. On some targets inlining might be best, on others not inlining is best. So how do you mark up your function?

Describing a transformation in code

Many of the situations I have been describing here are low level, but not lower level than (say) C/C++. It's as if there is this level above the low level where lots of interesting things could be done, but there is no way to describe them other than write them out explicitly, make target centric.

At this level much of the abstraction of the machine is defined. Using templates work at this level, but templates are not a great way to program. They are not powerful enough, and require clunky tricks to do often simple things.

What would be ideal would be some way of describing in some form of the language these kinds of transformations. Then provide other mechanisms to describe how to apply them.

It would be a goal that in order to produce good code for a new target it wouldn't be necessary to alter the compiler in any way, only to alter these descriptions and how they are applied.

The big part that is missing here is how to make such descriptions and where and how are they defined.

Separating Target Logic

On 3, there is some friction around having markup at the usage sites. For example imagine there are 10 targets that all require some different tweaks to perform well with a type. Do I really want to define all these tweaks with the type? That is going to get in the way of the type definition, which most of the time is what you care about. One could imagine a few improvements

Simulating SIMT on SIMD

(TODO)

Links