Lexing

Posted on: 2023-06-10

The following discussions are around lexing improvement ideas.

Conclusion

For next steps probably...

Store Tokens uncompressed initially
- We can probably compress quite well although perhaps not as much as before
Use local LexemeMaps for source files if source type is C preprocessable
Have a LexemeMap associated with a TokenStream
Have prototype LexemeMaps for each language. Each language has specialized lexemeIndices for keywords
For simplicity we can have a bit to store if LexemeIndex is used (so we can use for preprocessor paste)
Have a union for LexemeIndex or Length
Need to take into account hierarchy of languages

Can still consider having paste buffers stored within SourceLoc/SourceManager. This is preferable for a variety of reasons but is more complicated to implement.

Discussion

The lexing system has a significant amount of complexity because it supports the C preprocessor.

Local LexemeIndex

In the current system there is a LexemeMap. A Token looks something like

struct Token
{
    Type type;
    SourceLoc loc;
    LexemeIndex lexemeIndex;
};

The LexemeIndex can be resolved into the actual string contents, via the LexemeMap. As part of the system the FileSystem can store on a SourceFile a token stream. That token stream only makes sense via the LexemeMap. Since the SourceManager is effectively global (or at least potentially shared between compilations or just source files) so is the LexemeMap. Another nuance of the arrangement is that the LexemeMap is pre-filled in with keywords and other common names. Since the system supports multiple input languages and what isn't or is a keyword is target specific we have another table that can map the fixed LexemeIndices to a Keyword index. This extra lookup is just an array lookup, but it's still somewhat irritating.

The main observations here, are that

We can associate a LexemeMap with a SourceFile, and make the LexemeIndex local to the source file
We can always quickly merge a local LexemeMap into some composite (required for say doing preprocessor combination)
If the language doesn't include the c preprocessing we don't need to store the Tokens at all, or do any remapping (Tokens with local LexemeIndices is fine)
We don't need to use the second level keyword mapping we can tokenize for a specific language

On 2, lets say we are producing a TokenStream through the preprocessor. This combines tokens from many source files, which will now have their own local LexemeMap. This is a very fast process because

The LexemeIndices are all deduped
Lexeme indices are just indices (and they will be dense, as only local to the file)

So all that is needed is to iterate through the local lexeme maps lexemes - this is only needed for lexeme indices whos mapping changes (ie not global ones, and not language ones if the target map is in the same language). We add all the lexemes to the target lexeme map, and store the target indices. Doing so we produce a map in an array from the local indices to the target indices. It is now trivial to remap the Tokens lexeme indices.

On 3, if a language doesn't support the C++, we don't need to cache any of this information because presumably the the tokenization is only needed once to produce IR.

On 4, because the lexeme map is local to the source file, the lexeme indices can be tailored to the source type. We can make all of the LexemeIndices for the language an enum. It is trivial to test if a LexemeIndex is for a keyword, it only has to be in a specific range (specified by the language enum maximum value). It may still be necessary for some targets to have some other data to determine how/where a keyword can be used. I'm using keyword here rather loosely as we could have some identifiers have special meaning in certain contexts. The old mechanism could have multiple maps, but we would need a side channel to determine on context.

There could be an issue around source file usage - as here I'm assuming a source file is only used for one language. That's not hard to work around though, because if a file is used in a different way to the cached information we can always quickly remap it to another language, using the mechanisms described on 2.

On 1, one advantage of this approach is that we can tokenize in parallel, because there isn't a shared LexemeMap across source files.

We will need to consider TokenStream as having it's own LexemeMap.

For each language we will want something like a LexemeMap prototype, which is pre-initialized with mappings specific for that language. Prior to a lexing, we copy the prototype, and in doing so we know the special keyword ids are already setup.

Using LexemeIndex for Identifiers

In the current system all Token hold a LexemeIndex, and since they do

We don't need a separate paste buffer (we can just make a new LexemeIndex)
We don't need to store a lexeme length - it's held in the LexemeIndex

On the other hand it means

Every lexeme is stored in the LexemeMap
- Even if the contents could never be shared
- Even if the contents is huge
Sometimes deduping isn't really needed or appropriate (say with with a doc comment)

Doing a lookup from a SourceLoc to the source file is somewhat slow, because it requires in effect searching through all SourceFile for which one the SourceLoc is in range of. This is necessary for a "cold" lookup, but in typical source one lexeme comes after another. We could use this to make lexeme lookup fast. Something like

struct TokenStream
{
    Slice getLexemeFromSourceLoc(SourceLoc loc, Index length)
    {
        if (loc >= m_startLoc && loc <= m_endLoc)
        {
            return Slice(m_sourceContents + (loc - m_startLoc), length);
        }
        // The slow path.. looks up the associated SourceFile etc. 
        // It's assumed here it finds it.
        _findSourceFile(loc);
        return Slice(m_sourceContents + (loc - m_startLoc), length);
    }
    Slice getLexeme(const Token& token)
    {
        if (token.hasLexemeIndex())
        {
            return getLexemeFromLexemeMap(...);
        }
        else
        {
            return getLexemeFromSourceLoc(token.loc, token.lexemeLength);
        }
    }

    const char* m_sourceContents;       ///< The contents of the source file
    SourceFile* m_sourceFile;           ///< The current source file
    SourceLoc m_startLoc;               ///< The start loc for the file
    SourceLoc m_endLoc;                 ///< The end loc for the file
};

In such an arrangement you might have a Token stored as

struct Token
{
    Type type;
    SourceLoc loc;
    U32 lexemeLength;
    LexemeIndex lexemeIndex;    ///< Only used for `Type` that need deduping
}

Which means you could store a Token as

struct Token
{
    Type type;
    SourceLoc loc;
    union 
    {
        U32 lexemeLength;
        LexemeIndex lexemeIndex;            
    };
}

If we have a lexemeIndex we don't need to store a length as the LexemeMap stores the size along with the contents.

Part of the idea about having the Token not store the contents more explicitly, is that for many token types when parsing there is no need to look at the tokens text content, the Token::Type is enough. For the ones that do the slight extra cost of finding the contents is believed to be outweighed by the actual processing.

One of the ideas about using LexemeMap to store, for example, literals, is the assumption that we could use as a kind of side channel to speed up parsing. The side channel would map from the lexemeIndex to a value for example. If the Token::Type is IntLiteral, we can store the integer value. We don't need to convert it into an int if there are repeated values. This only makes sense if it is faster to do a lookup than it is to do the conversion. That is probably true for some types (say float conversion), but not others (say a short int).

Hierarchy of Languages

Lets say we make LexemeIndices use well defined ids for specific languages as discussed elsewhere. As part of that idea we use well defined values for specific LexemeIndices, probably represented via an enum. We could see these enums and sets of lexeme indices as a kind of hierarchy.

For example

Base
Preprocessor
C
C++

Where the base (probably) holds no LexemeIndices. The Preprocessor holds the preprocessor keywords. C holds Base/Preprocessor and C keywords. C++ holds them all.

If we are careful we can make the indices compatible subsets of one another. This additionally means that any lower level language use can use the LexemeIndices as is. A higher level will require require remapping, because what is a keyword in the "derived" language might not be so in the super language/s.

This idea also extends to allow the preprocessor to work with non C/C++ based languages. For example a language which supports preprocessor, but isn't C/C++ can just derive ids from pre-processors.

We can use ranges to indicate different categories of keywords. This idea could also be extended to other kinds of subsets, and would be faster than doing a table lookup.

Paste Buffer

There are bits available in the Token that could control if the lexemeIndex is being used or not. If we just had one type that used the lexemeIndex such as identifier, then we wouldn't need the bit, we could just have a test for Type == Identifier. The reason to want to make the distinction separate is that the C preprocessor can paste tokens together to produce a new token, and the pasted token doesn't have to be an identifier.

There are other ways of dealing with this - for example we could create paste buffer locs. Then we could keep the same logic. The problem is that implementing the paste buffer as locs is somewhat more complicated, because it implies

Having a SourceFile that can contain multiple pastes
- Ideally we don't want to allocate a File for every paste!
Making sure the SourceFile is large enough
- If a very large paste is produced, we might want to allocate a SourceFile for all of that
Ideally we'd have data/logic where it was possible to determine how the paste was produced
- If we did we could output a nice diagnostic showing a tokens origin

On the other hand doing it via lexemeIndex, is trivial as the LexemeMap manages storage. It additionally dedups storage. We don't need to create paste SourceFiles and manage them. The disadvantage is there is no way to track how a pasted token was produced.

Token Storage

In the SourceFile we allow token streams to be stored in lots of different ways - in particular we allow Token compression. The idea here was to store with source token streams so it wasn't necessary to retokenize when a file is reused. This is taken further because for files that might contain c preprocessor constructs the token stream is broken down into preprocessor AST with streams of tokens. This makes performing preprocessor expansion simpler and also probably considerably faster.

Part of the idea of storing compressed TokenStreams was to reduce memory storage. Part of that idea was to improve performance because being a smaller representation would improve cache usage. It's not clear that any of this makes a significant performance difference in practice though. The mechanisms for TokenStream allow for switching out implementations, so I did some very simple tests around that and didn't see much of a bump.

If we can keep a Token to 12 bytes it might be reasonable to leave uncompressed at least for now.

Compressed Stream

You could imagine something like

struct CompressedToken
{
    U8 control;
    Type type;
    // Followed by bytes determined by control
};

A Cursor would need to store

Index to current position in the byte stream
Current SourceLoc - holds position to start for current lexeme
Previous LexemeIndex - applying the delta takes u

Another way to store this would be for white space delta is stored in the current Token to take you from the end of the previous lexemes SourceLoc to the current. In that case it would be

Index to current position in the byte stream
Previous SourceLoc to end of last lexeme
Previous LexemeIndex - applying the delta takes u

Therefore a compressed token needs to store

Type
Either a (potentially small) whitespace delta SourceLoc or a full delta
Either a length or lexemeDelta

struct Control
{
    U4 whiteSpaceDelta;             ///< Only needed if `isLargeWhitespaceDelta` is false
    U2 lengthLexemeByteCount;       ///< How many bytes store the lexemeIndex delta or the length   
    Bool isLexemeIndex;             ///< If set has a lexeme index following, else it's a length
    Bool isLargeWhitespaceDelta;    ///
};

We could do a switch on the bottom two bits, for 4 basic scenarios.

Alternatively perhaps we could store short lexeme lengths in Control, at the cost of size of whiteSpace delta. So something like

struct Control
{
    U2 lexemeLength;                ///< Set if the lexeme length is small (and it's not a lexemeIndex maybe?)
    U2 whiteSpaceDelta;             ///< Only needed if `isLargeWhitespaceDelta` is false
    U2 lengthLexemeByteCount;       ///< How many bytes store the lexemeIndex delta or the length   
    Bool isLexemeIndex;             ///< If set has a lexeme index following, else it's a length
    Bool isLargeWhitespaceDelta;    ///
};

Yet another alternative would be to store an enum for the common cases, that we can switch on common scenarios and others are encoded with less in Control.

Another thought might be to just use some standard compression mechanism such as lz4, and store the stream compressed. Then expand when needed. The advantage would be greater compression, and not requiring a clever mechanism to stream. Doing so could mean that the TokenStream doesn't have to be an abstraction, but is literally a stream of Tokens.

For initial implementation just making it non compressed using streaming mechanism that is already there is probably reasonable.