Pointer Generalization

Posted on: 2023-06-03

Pointers are a core feature of many programming languages. In C/C++ languages raw pointers boil down to hardware mechanisms. What I'm going to discuss here is not the low level concept of a pointer, but a variety of related pointer like behaviors. It may make sense from a language point of view to make them distinct concepts from "pointers", but it seems worth discussing as the ideas are related.

Different layout
- SIMD usage
- Memory access/performance
- Serialization/copying other reasons
Different address spaces
Different "devices"
Indices/handles
Hardware pointers
Styles
- Nullable pointers
- Pointer arithmetic
- Bounds checking

I think one key language feature I would want to see is support many different layouts of a type being usable interchangeably, with the compiler implementing the mechanisms to access, iterate and convert such that usage is simple and performant.

What is a pointer?

If we wanted to generalize we could perhaps say it's "an indirection". The use may be for convenience - a way to specify something in a different way. It may be way to simplify logic, by allowing other sections of code to access some data without having to know the path to get to that specific bit of data. It might be used to share data. In sharing it also allows communication. In a single threading program the sharing is straight forward as it's possible to read and write in different sections of code and the changes be seen in a temporally consistent manner. With multiple cores, this is typically no longer the case and other mechanisms are needed to coordinate between cores such that there is consistency.

Pointers and Hardware

In this section I'll use pointer to mean what is normally a "raw" pointer in C.

With a pointer the indirection is to a memory location, the value being a number which is typically the byte offset to the data aka "the address". With typical modern processors the address is not a physical address, but a logical address that is converted to a physical address via a Memory Management Unit (MMU). The use of the MMU does mean that the same physical address could appear at multiple logical addresses. It is perhaps worth saying that logical addresses don't need to map to physical memory at all - it could be being used for memory mapped I/O, or perhaps virtual memory where a particular page is not in memory.

A pointer doesn't necessarily have to be just a logical (or perhaps physical) address. The CHERI standard provides mechanisms to store information within a pointer to allow hardware support for checking pointer usage is valid. This interestingly also lead to 128 bit pointers, which are now larger than the typical 64 bit word size.

A programmer might use other bits of a pointer to indicate other information.

Whilst typical CPU pointers work in isolation (you can read and write their contents) and do not need any extra data typically, it's not always the case.

Some processors have multiple address spaces. Two real scenarios that spring to mind are the scratchpad on the playstation 2 and the SPE memory of the cell processor of the playstation 3. Scratch memory is a section of address space that is backed by faster memory. It is accessed via regular addressing. The SPE had local memory that is not directly visible to the PPE. Data can be DMAed between the the SPE/PPE, under programmer control. This created considerable complication from a programming point of view.

GPUs have many different address spaces - thread local, group local, shared, global. Resources could be considered as a kind of programmable address spaces.

One popular instruction set x86_64 has it's pointer behavior as somewhat of an illusion in most programming, because a logical address is actually the combination of a segment register and the pointer. It just so happens that in most scenarios the segment registers are setup to map to the same physical addresses. Whilst not exposed directly the segment register differences are used - for thread local memory on windows for example.

Many of these differences are not exposed in languages developers use. If we want a systems level language that can produce performant code it would be helpful if

The type system knew something about these differences
Depending what is a available on a target the compiler does something at least "sensible" (if that is possible)

Composite Pointers

There are software tricks that allow for pointers to be used in other ways. On the Atari ST only 4MB of memory could be addressed, but addresses were 32 bit. This led to some software using the other bits for other purposes. Tempus is one well known product. Something akin to this has reoccurred on 64 bit processors where data can be stored in "unused bits".

A less controversial trick is to use the fact for many types, alignment is a requirement, and this leads to the lowest few bits being available for other uses. One use is as a tag identifying what kind of thing is being pointed to. Since such uses are outside the visibility of the type system it makes debugging more difficult. Whilst this is a little bit of a digression, it hopefully shows some of the breadth of "pointerness" even with "raw" pointers.

Ideally a systems level programming language would provide

A convenient way of specifying and composite pointers
The type system provided support - to convert, construct and so forth

Such tricks whilst not that unusual do have significant issues

Debuggers don't know how to deal with them
Systems that expect pointers to be stored in memory directly have problems
- A system I developed for memory usage tracking relied on pointers being stored as is (with the possibility for special cases based on types)
- The Boehm garbage collector will have problems
- Memory leakage tracking, detecting
- Automated systems for detecting program issues might have problems, tools like Valgrind

In C++ some programming convenience and type safety can be achieved using a template. On some debuggers it is possible to write code to make visualization work such as with natvis in visual studio and there is code execution within gdb.

Indices and Handles

Using hardware pointers can cause problems

Pointers are large.
- To store 64 bit in most scenarios and with CHERI are going to be 128-bits (!)
- It's not just the size it's also the alignment.
Pointers cannot be serialized
They often require the use of maps, which works but is slower than say an array
They imply at least the potential for sharing, so are hard to optimize, can cause race conditions or modification altering an input
Limited additional information is possible - although they can be ordered
- For example having certain shared items use the same identity represented by a compile time value (say an enum)
- Whilst this is possible with global pointers - it doesn't work in C++ with switch
Aren't usable on some targets (such as typical GPU languages with the exception of CUDA)
Can be too low level. What is a pointer to an item in an interleaved SOA representation

For data orientated programming (DOD) how memory is used becomes of critical importance. Limiting the amount of memory that is required can lead to large performance increases.

Some of these issues can be addressed with relative pointers. They address the serialization issue and can allow for smaller "pointer" sizes depending on how far away a relative pointer points. Whilst this is useful, and something you'd hope a system orientated programming language would support, it also implies all of a data structure is stored in contiguous chunk of memory, which limits it's practical uses.

So indices to the rescue! The advantages

Can vary the size to the amount of bits/items needed
Can hard code certain indices to have special meaning
The ordering can have additional meaning
If the indices are compact
- A map can just be an array
- Associating a bit can just be a bit, in a bitset (rather than, say a set of pointers)
There isn't an issue around sharing (at least not until the index is used)
They can be bounds checked, easily and quickly and sometimes for free in hardware
Indices are a better fit for SOA like representations
They are serializable
They are copyable
Single index can 'reference' (or "bind together") many different things
- For example is [DOD)(https://en.wikipedia.org/wiki/Data-oriented_design) an entity is an 'index', it's state held in multiple separate arrays

An index doesn't work in isolation, it will need to be combined with some other structure in order to access data. For example an array.

An argument could be made, because an index can't just be used to access data on it's own it's not really a pointer. If you read the section on hardware pointers, that distinction doesn't entirely hold, because on x86 a pointer is combined with other data - the segment register. On GPU architectures the combination of resource and index are typically your pointers. Or using "bindless" on GPUs where are resources themselves are accessed in a pointer like manner, the pointers are indices.

Perhaps yet another way of looking at this would be that the "context" changes the meaning of a pointer. A pointer on a PPU and a pointer on an SPU don't point to the same thing, they are in different address spaces. So SPU-ness or PPE-ness could be seen as a global prefix on the pointer.

Whilst perhaps a little belabored the point I'm trying to make is that indices and hardware pointers are actually quite similar. Systems programming languages whilst low level do not often make the nuance around pointers visible. Languages would benefit from having type systems that would support using indices and handles being used in pointer like ways.

The main issues with using indices

They are too flexible. An index can be used anywhere interchangeably even when that's not appropriate
- It would be useful to associate with the structure they can be used with
- If the index returns a type, it would be useful for the index to say it's an index to an item of a type
There may be a performance implication
- An index cannot be used directly, it may require some extra ops before it's contents are accessible
- In practice this rarely seems an issue

Before digging further, lets talk a little about handles. A handle is an opaque type to data. In some scenarios the value contents of handle is the data. That's interesting and useful, but is not pointer like - the handle is a value type. Sometimes it really is a pointer, just with the type information removed. This is a useful abstraction in some scenarios, but in essence most of what is applicable to pointers is applicable to such handles. There is perhaps an argument about wanting to have language support for some type safety around such a representation, such as across an API boundary automatically converting between the handle and the internal representation.

A combination of an index with some extra data allows for a pointer-like type that handles dangling pointers. There are many ways to do this, but the most common is to associate an index with some kind of generational count. When an item at an index is destroyed, the generational index associated with that index is increased. Any remaining handles can detect when used if the item reference has been destroyed by checking if the generational counter stored with the index is the same as the items. If it isn't the handle is no longer valid. The index can be reused because the generation value will identify which 'version' of the item is being referenced. There is a limitation here in that the amount of bits for generations is typically limited, so it is perhaps possible (without some other logic) to reference to a destroyed thing to use the "new" thing. For many applications, for example games, this may not be an issue.

Something akin to generational handles is possible with indices too, if we split up indices into different states. If we can traverse all items, and their usage of other items we can detect if they reference something destroyed, and mark them as null. Once we do this we are then free to reuse the index. More concretely we could have

The index 0 is used to identify null
All other indices are either live or dead
When we destroy an index we add it to the dead list, and mark the index as dead
When we create an object we first look on a free list, if there is an index we use it, if not we add to the end of the list

Using these rules, when we get to a certain point we can garbage collect. We go through all references, if they reference something that is dead we change the reference to 0 effectively null. We then copy all of the indices in the dead list into the free list.

Something similar to this can be achieved with generational indices, when the generational counter hits some maximum a garbage collection is issued. The problem with this for many applications is that such a collection could be slow and happen at somewhat arbitrary times which is why for games, it's more common to just live with a uncommon hit scenario.

With the generational counter mechanism, an indirection will require extracting the index, doing a lookup and then comparing the generational count. Such operations are likely fast, but it is extra overhead.

Layout

In C++ a pointer points to a type that has some consistent in memory representation. That same representation will typically be used for any in memory representation. If the structure is in registers or partially in registers, that is typically a transitory state and the in memory representation is the ground truth. If a pointer is taken to such a representation, and a function is called with it, before calling the function (unless say inlined), it will typically require anything held in registers be written back to memory.

Unfortunately memory is slow, certainly compared to registers.

The normal C++ representation may be far from optimal. Most modern processors have some kind of SIMD capability. The C++ array of structs representation is typically sub optimal representation. Doing a conversion between AOS -> SOA, is typically costly enough to wipe out any performance benefit. What kind of performance advantage is possible with SIMD friendly layout? Well getting close to 4x is common with a typical 4 wide SIMD implementation, and it's not uncommon for a variety of reasons to do better.

There are a variety of problems here

SIMD intrinsics are target specific
- Typically requires writing for specific targets which is clunky slow and error prone
- You can use a library - but they are often either large or clunky or both, and leave performance on the table
- The library will tend to infect your whole code base
SIMD widths change even on the same archs - say SSE and AVX512
- I don't want to have to write my algorithm multiple times ideally
This says nothing about ABI boundaries - how should I pass my data in and out of an API?
- If I have multiple implementations - it's not obvious which one to use
Auto vectorization doesn't work very well
- It works for simple cases (like summing up floats etc), it doesn't generally work for complex structures
- It sort of can't work that well in general - because if the in memory representation is fixed, there is going to be a significant cost thrashing between representations

When I'm talking about SIMD SOA like representations I typically mean interleaved SOA.

Leaving aside the SOA, sometimes it is advantageous to use a "SIMD aware" AOS representation. Take for example

struct Thing
{
    float3 position;
    int32_t index;
};

// A SIMD aware representation... something like

typedef float4 Thing;
Thing makeThing(float3 pos, int index) { return ...; }
float3 getPosition(const Thing& thing) { return replaceW(thing, 0) }

/// An interleaved SOA like representation. Each struct holds 4 Thing
struct Thing
{
    float4 x;
    float4 y;
    float4 z;
    int4 index;
};

// A SIMD SOA like representation, can hold any multiple of 4.
// Unless there is a special reason interleaved SOA is 
// typically better, better cache locality, only requires a single pointer 
struct Thing
{
    float4* x;
    float4* y;
    float4* z;
    int4* index;
}

I could encode this a single 4 wide SIMD register. Care would need to be taken around index. If the index is 23 bits, the value could just be stored in a valid float. Most SIMD systems allow for the interchange of meaning of the lanes without cost. If the full 32 bits are required, it could just be stored using the float bits, and then when operations are performed on position, that lane can be masked. There may be a performance issue if the int bitcast as a float is a NAN or denormalized, so care may be necessary.

What would really be helpful would be to free the compiler to choose a suitable representation depending on usage and target. Ideally being a systems programming language it should be possible to specify a particular representation and let the compiler convert to different forms, as is required. In the language we could specify Thing in the first example, but could convert and access other representation as if it is the original. This including support for iterating such that multiple items can be processed in parallel.

What should a pointer to Thing look like if we allow this? For a AOS, the pointer just points to the struct, but for the others this doesn't work. The most simple answer that always works, is a pointer is an index. There are other options

A fat pointer
- Could point to the start of the list/containing structure, and the index
- Could point to the start of the interleaved item, and hold the offset to into the interleaved representation (0-3 in the example above)
Something akin to composite pointer, packing the interleaved start and the offset
- Only works if the amount of free bits allow storing the offset
Pass separately
- Pass the resource and the index separately
Have some language construct that associates a "resource" and an "index"
- Doing so could simplify API passing items
- For some resources (say global on GPU), the resource doesn't need to be passed at all

Packing

There is an issue around how many elements packed into into a fixed array "interleaved" representation. The most "obvious" answer is the SIMD size, but this leads to wasted space due to typical alignment issues with SIMD types. Take for example

struct Item
{
    bool isActive;
    float value;
};

// The "obvious" 4 way sind version
struct Item2
{
    uint8_t isActive;           /// We use a bit for each lane
    float4 value;
};

// Perhaps the isActive value is being used 
// alot and directly within processing so you might end up with
struct Item3
{
    uint4 isActive;         ///< 0xffffffff if true, 0 if false
    float4 value;
};

Item3 might be appropriate for some scenarios if masking is the main mechanism that is going to be used. On most systems a bool is a byte. By making it a mask it's now 32 bits (!). Item2 uses the smallest type that can hold the necessary bits - a uint8_t, but unfortunately because float4 is probably 16 byte aligned, it actually takes up the same amount of space as Item3. AVX2/AVX512 has predicate registers, where the bit representation is probably more appropriate. If we only have 4 bits, we could use a table or do some masking to turn bits into a SIMD register.

auto mask = broadcast<uint4>(isActive);
mask &= uint4(0x8, 0x4, 0x2, 0x1);
mask = (mask != 0);


// Or with a table
auto mask = g_table[isActive];

Which is faster depends on the arch. As the amount of bits needed (with wider SIMD), the less attractive the table becomes.

To address all this waste, it seems reasonable to expand the amount of items. We can have code act as if it has a wider SIMD width.

struct Item4
{
    uint8_t isActive;
    float4 value[2];
}

Now all bits of isActive are used. Item4 becomes 48 bytes instead of 32, but now it holds 8 elements. The more we widen, the less waste due to padding there is - from 32/4 = 8 bytes per item to 48/8 = 6 bytes.

There are several issues

If we make it too wide, we start running out of register space
The wider we make it the less efficient it becomes with smaller arrays
If we make the amount of elements non power of 2, it requires a divide to do an index based lookup

On the last point the divide might not be too terrible, because it will generally be known at compile time and most modern C/C++ compilers use a multiply trick to avoid a divide. We also need the remainder - so I think that will probably require another multiply. Whilst that is not as bad as a divide it's still considerably worse than a shift and and a mask. When iterating through items, the index conversion can be avoided. So depending on access patterns it may not be too bad.

In general though it is desirable for the the amount of items held to be

At a minimum SIMD width
Power of 2
Not too big

Additionally we may want to rearrange elements to make packing less wasteful. This is true generally speaking and some newer languages make such rearrangement the default. Zig springs to mind.

For example if we had

struct AnotherItem
{
    bool isActive;
    float value;
    bool isGood;
};

// On a CPU (unless required for some ABI/other reason) we'd be 
// better off if this was

struct AnotherItem2
{
    bool isActive;
    bool isGood;
    float value;
};

// And similarly when making interleaved SOA, we could do

struct AnotherItem3
{
    uint8_t isActive;
    uint8_t isGood;
    float4 value;
};

Once we have divided the index by the amount of items in a 'Block', to actually access the element, we need to go the start of the block that holds the element, which is typically a fast (perhaps shift and add) multiply, and then the remainder is used to lookup the actual item within the block.

Partial

When we used interleaved SOA, we pack together some fixed number of elements which I'm going to call a "block" for want of a better name. Typically though the amount of elements in an array is not going to be a multiple. If we make index 0 the start of a block, we only need to worry about partials at the end of the list. We can't always have the start be index 0 like this though, for example when we have a slice, how to deal with this will be discussed in semantics.

Even if index 0 is the start of the first block, if we want to operate on sub section, we may have one of

No items
1 item (can only be in 1 block)
<= block size, but in one block
<= block size, but straddles two blocks
block size

If we had a mask to control what lanes are operated on, we could have two versions of the "work" of the loop. One that does a whole block and one that does a block with a mask. Whist that works, it may also be the case that for very small sub sections it is actually more efficient to extract the item and do the work in a scalar fashion.

Operating on a whole block with masking doesn't seem trivial. If there is some operation where a result is written back we will need to merge based on the mask. If we have some operation that is some kind of reduction (sum/max/min etc), then we need a version of that operation that honors the mask. How the mask works will be dependent on the operation. If we want to produce an index, that will also need to take into account the operation and the mask.

Semantics

Pointers can be used in many different ways. In C++ we also have references, which whilst not quite the same as pointers in a variety of ways, much of what follows is applicable. A reference can crudely be thought of a non assignable, non nullable pointer. This isn't quite right, because a const reference, can be treated as a value.

Pointer to some part of a larger array
- Allowing pointer arithmetic to traverse
A way of passing as single item without copying
- Typically to improve performance
As a way of passing out values from a function
- When there is more than one return value
- When we don't want to return, as it might be costly
As a way to passing in something to be modified inout
Sentinel terminated types
- C style strings being the most obvious and probably common
Pointers are a fixed size, whereas what they point to doesn't have to be
Pointers to interfaces
- Can use inheritance to derive from a type to make different implementations
- With virtual methods, the implementation doesn't need to be exposed
Pointers to COM interfaces
- Can version and combine
- Provides a homogenizing mechanism, akin to any
- Allows querying of capabilities
- Uses reference counting to control lifetimes - actual implementation is
- Can have many implementations of the same interface
- COM is a binary interface allowing use across languages
- An interface can be defined multiple times as long as it is compatible

Using pointers to make some implementation opaque is useful, just as object orientated programming remains useful. Using reference counting is a reasonable lowest common denominator way to do things that is simple and effective. Such mechanisms are not useful on all targets, but on CPU like targets they do the job. Such an abstraction is perhaps somewhat heavy weight, requiring virtual function calls, and typically atomic reference counting. For targets and scenarios where these aren't a problem it's a good solution, and something I won't touch on more here.

A good systems programming language will allow the use of a wide variety of abstractions. It will also allow going down close the metal. So lets explore some common scenarios, in the context of one of many items, that is indexable. The distinction is made, because if there is just one item, it can just be a single struct.

It's perhaps worth adding that in/out and inout have specific semantics - they are not the same a pointer/reference in general. The behavior makes it clearer in code what the intent is as well as allowing a compiler to check for aliasing scenarios, and apply more optimizations because the semantics typically mean there is no aliasing which could cause problems over say reading and writing.

Single Item

We have a few scenarios

Hold the pointer to the start block, and the index from the start
Hold the pointer to the element block and a sub block index
Have a pointer contain both the pointer to the element block and the sub block index aka composite pointer
Pointer to the start block, the block index, and the sub block index

2 has the advantage over 1, that we don't need to calculate the offset to the start of the block that contains the item. The cost of that calculation depends on the cost of dividing/remainder of the amount of elements in a block, plus the cost of doing the address calculation for the block. If the amount of elements is a power of 2, the cost is fairly low but probably not negligible.

3 only works if there are enough bits available to store with the pointer. It's perhaps not a bad solution, in that an item can be passed in a single pointer.

4 is a kind of compromise, in that it doesn't require the full index to block index calculation, but still requires offsetting.

I suppose it might be desirable to compare pointers for ordering. If we assume there is no overlap they are all relatively fast.

bool isLess1(Ptr a, Ptr b)
{
    return a.start < b.start || 
        (a.start == b.start && a.index < b.index);
}

bool isLess2(Ptr a, Ptr b)
{
    return a.block < b.block || 
        (a.block == b.block && a.subIndex < b.subIndex);
}
bool isLess3(Ptr a, Ptr b)
{
    return a < b;
}

bool isLess4(Ptr a, Ptr b)
{
    return a.start < b.start || 
        (a.start == b.start && a.blockIndex < b.blockIndex) || 
            (a.start == b.start && a.blockIndex == b.blockIndex && 
                a.subIndex < b.subIndex);
}

I think it's reasonable to rule out 4 just because of the extra complexity.

All things considered, 2 and perhaps 3 seem the most attractive. The pointer comparison (and equality?) behavior perhaps favors 3 but only works if we can represent the block and the subIndex in a single pointer. It would be fast to convert 2 into a pointer (or a pointer sized int) again that only works though if we can combine them. It's also worth saying that using 3, any access will require some extra ops to extract the block and the subIndex, which isn't necessary with 3.

On the other hand with 3, it cannot be held in a register, and the subIndex only holds very few bits, so seems somewhat wasteful.

Slice

A slice specifies a range of items, again we have a few choices.

A start and end single pointers (however represented)
A start block with a start and end index
Start element block, and sub index and a length

For a slice we require the start must point to a element in a block, but the end can point to the start of the next block.

As well as being able to iterate over items, it is typically important to quickly be able to get a count. We can calculate the count trivially for 2 and 3. On 1 if the block sizes are not power of 2, it could be considerably more costly. Something like

// Block of some amount of elements in SIMD form
struct Block
{
    enum { ElementCount = ... };
};

Count getCount1(Slice slice)
{
    auto startBlock = slice.start.block;
    auto endBlock = slice.end.block;
    if (startBlock == endBlock)
    {
        return end.subIndex - start.subIndex;
    }   
    // Probably requires a divide...
    return (endBlock - startBlock) * (Block::ElementCount + 1) - 
        slice.start.subIndex + slice.end.subIndex; 
}

Count getCount2(Slice slice)
{
    return slice.end - slice.start;
}

Count getCount3(Slice slice) { return slice.count; }

Another operations might be

Make a slice (subSlice, head, tail)
Iterate over a slice
operator[]

With 2, this is trivial. With 3, it depends, but will require some work to calculate block and subIndex, this seems reasonable enough, and would be required for any access anyway.

With doing [], 2 is trivial. With 3, is close to the same except we add the start subIndex, and do the lookup as if the block pointer is the start.

A nice-ish property of 3, is that it is in effect the same as single pointer option plus a length, so can trivially be converted between.

So probably 3 is preferable.

Pointer arithmetic

Most common pointer arithmetic operations are comparisons, ++, --, +=, -=

Lets assume the pointer that can be used for arithmetic is the same as a single pointer (in terms of representation - we may want a different syntax).

If we assume 2, then ++, += or --, -= require checking changing the block pointer. If the amount of elements held in a block is a power or 2 this makes this somewhat easy.

Ptr::operator-=(in out Ptr ptr, Index delta)
{
    Index index = ptr.subIndex -= delta;
    ptr.block += (index >> shift);
    ptr.subIndex = (index & mask);
}

I suppose there is an issue around subIndex in this encoding only needing a very small amount of bits.

Bounds checking

(TODO)

GPUs and CPUs

The discussions here so far have largely been around SIMD and CPUs. GPUs have the convenient property that many of this issues are handled invisibly via the languages used and in hardware. So what does this look like from a programming language model point of view? What would equivalent code generated for GPU look like?

(TODO)

Other ideas

Memory is the "unifying" and "homogeneous" underlying representation of things
Memory can't be abstracted away in a systems programming language, for performance and other reasons

Combinatorial explosion of Scenarios

One aspect of this discussion is that there are many ways to get get some code base to work effectively, but how to do that changes depending on the hardware of the target.

There are several ways this might be approached

Have a language that represents the low level behavior and allow building of abstractions
Have a compiler that understands the different targets
Have mechanisms to "markup" declarations that say what is desired for a target

All are to a degree applicable.

C++

C++ is most like the 1, and it is possible to create some of the kinds of transformations described here through template meta programming. It is also possible to use the pre-processor, or even switch out different implementations depending on a target.

C++ compiler is designed to compile efficient code for a particular target. There are switches to control optimizations and assumptions. This works at the level of the C++ abstract machine, which doesn't closely fit the kinds of hardware that is available. It doesn't closely fit for SIMD or for GPU for example.

It is possible to provide some markup around conditionals or if inlining of a function is desired. The problem is the programmer often has to guess as to whether such a markup is suitable. On some targets inlining might be best, on others not inlining is best. So how do you mark up your function?

Describing a transformation in code

Many of the situations I have been describing here are low level, but not lower level than (say) C/C++. It's as if there is this level above the low level where lots of interesting things could be done, but there is no way to describe them other than write them out explicitly, make target centric.

At this level much of the abstraction of the machine is defined. Using templates work at this level, but templates are not a great way to program. They are not powerful enough, and require clunky tricks to do often simple things.

What would be ideal would be some way of describing in some form of the language these kinds of transformations. Then provide other mechanisms to describe how to apply them.

It would be a goal that in order to produce good code for a new target it wouldn't be necessary to alter the compiler in any way, only to alter these descriptions and how they are applied.

The big part that is missing here is how to make such descriptions and where and how are they defined.

Separating Target Logic

On 3, there is some friction around having markup at the usage sites. For example imagine there are 10 targets that all require some different tweaks to perform well with a type. Do I really want to define all these tweaks with the type? That is going to get in the way of the type definition, which most of the time is what you care about. One could imagine a few improvements

Being able to package up and name a combination of options
- This would allow in many cases minimal markup at usage/declaration sites
Being able to write such details away from the declarations
- Perhaps they can be placed at the end, or in some special section of the source or a separate file in a package
Having in separate files could provide the default descriptions
Being able to write code that can determine what mechanisms to use

Simulating SIMT on SIMD

(TODO)

Pointer Generalization

What is a pointer?

Pointers and Hardware

Composite Pointers

Indices and Handles

Layout

Packing

Partial

Semantics

Single Item

Slice

Pointer arithmetic

Bounds checking

GPUs and CPUs

Other ideas

Combinatorial explosion of Scenarios

C++

Describing a transformation in code

Separating Target Logic

Simulating SIMT on SIMD

Links