The Worlds Fastest Markdown Parser

Posted on: 2024-07-14

Today I announce the worlds fastest markdown parser - meep (*)!

NOTE! It might not (probably not ?) be the worlds fastest - I only tested against cmark ;)

I've been using markdown for many years. It seems pretty straight forward. But just like my previous discussion around the meep c preprocessor, it turned out to be a rather humbling experience.

I didn't need a complete implementation - only something that could be used as part of documentation support. On looking around for a markdown spec I came across commonmark. In the end though I did do a fairly complete commonmark implementation.

The spec is surprisingly complex. In my naivity I thought I'd use the spec as reference for checking things to make sure it is broadly consistent. I did write an original implementation that did most of the commonmark spec, but as I dug deeper into behavior - especially around links and images - it started to get increasingly complex to produce similar results.

It was at this point I looked at other implementations and found them surprisingly complex and started reading the bottom section of the commonmark spec which very helpfully explains a basis for an algorithm. I didn't follow exactly that algorithm, but used as a starting point. This probably undersells how much rewriting, and throwing away of code I did.

The c pre processor is a "language" that came out of many years of changes, implementation and improvements. On a cursory level it seems like it would be possible to implement it in may different ways, but in practice if you want to have it behave in the same way as production compilers, you likely end up having to implement it with something very similar to the canonical implementations. The same in the end happened with my commonmark implementation. If I wanted to follow the commonmark spec in the end I had to largely follow their parsing strategy.

I'm thankful for all those who wrote the commonmark spec. They unified the markdown concept. The spec isn't always easy to understand, but by making it so easy to test things out directly on the webpage and having numerous examples it was instrumental in this effort.

Features

Supports commonmark spec
Adds support for html comments
Converts to a standard Document type which provides a superset of rich text functionality
Currently supports Document -> HTML output
Embeds link definitions in document, so can be output as references or in other ways

Limitiations

Currently doesn't support all named html character entities (as in &somename;) - there are alot of them
When producing a link reference label doesn't fully support UTF case conversion (it's just ascii currently)
Supports embedded HTML, but doesn't currently support all of (HTML block types](https://spec.commonmark.org/0.31.2/#html-block)

Algorithm

A summary of reasons that the commonmark parsing strategy needed to be used

The lazy continuation implies requiring something like an explicit stack
- Previously I tried doing my parsing recursively for example
You cannot do "inline" text processing in the first pass
- One reason for this is link references - the definitions can be anywhere in the file
- If a definition isn't found the link reference stops being a reference
Link and image processing is hard
- Especially when it comes to nesting behavior

I had some special self implied restrictions in my implementation

I wanted to use the infrastructure I already had around parsing
- That implied I required all lexing/parsing working on the original source
- Alternatively you could provide mappings from contiguous text to the original source, but that quickly gets complicated because of nesting of structure
This led to LineParser type that simplifies working with line segments fromt he original source
- This still ends up being way more complex than just a char* but requires no extra allocation or mapping

My initial versions tried to parse recursively. I was able to make this work to a degree, because I allowed backtracking on the construction of the Document. Whilst that worked it always seemed like a bit of a hack not least because it required a "dummy" diagnostic collector to ignore errors/warning to be able to back track.

In the end the algorithm comes in several parts

LineParser a type that allows disjoint lines (or parts) to be processed as contiguous text or lines as needed
- This is used by the "inline" text part of parsing not line processing
- It is still more clumsy than a typical lexer for a source file
- I did write a type that made traversal "transparent" - but it was complicated and dealing with "lines" can be simpler and faster on occassion
MarkdownParser is the main main parser, it coordinates with the MarkdownLineParser and the MarkdownTextParser.
- It also has the reponsibility of traversing the MarkdownLineParsers parse tree and converting that into suitable Document structures
MarkdownLineParser is responsible for converting source file into a tree (aka block structure)
- Tree nodes are either "containers", that can hold children or inline "text"
- "Inline" text is handled as needed via later use of MarkdownTextParser
MarkdownTextParser is reponsible for turning text spans into Document (aka inline structure
- This is the most complex part of the algorithm
- Uses LineParser to take spans of text and treat as contiguous text
HtmlDocWriter takes a Document and converts to HTML output

I follow fairly closely the block structure part of the algorithm, although my representation is not pointer based but index based. I hold all nodes in an array.

For the "inline" text part of the algorithm, I follow the basic mechanisms that are used for link and image parsing. My stack is just a dynamic array though, and I don't have a doubly linked list. I have two passes here. I have a pass that produces a "delimiter list". This scans forwards in the text adding Delimiter structures for emphasis, potential images/links, autolinks etc. It has a delimiter stack, and I use this when I hit image/link closing to determine what to do, in a similar way to the commonmark algorithm. When emphasis is seen I add flags marking how the marks should most likely be interpretted - as opens/closes or both. I also have flags to indicate if a Delimiter is used or deativated. If it is deactivated it can be ignored/removed. If it is not used we haven't resolved how to interpret it yet.

Then I pass of the delimiters in a forward direction. At this point all of the link/images have been classified. Going forward has the advantage of having more natural/intuative results for other kind of inline markers. In this pass how all Delimiters are to be treated is resolved. This pass also ensures that we consistently open and close Delimiters such that they are balanced. After Delimiters are either Used or Deactivated. Finally we strip out anything that isn't used.

Delimiters record the span of the original markdown that they are associated with. So now to finally convert we in effect

Convert text span/s between Delimiters
- This means decoding html character entities (aka &) and markdown \ escaping to the output
When we hit a delimiter
- If it's an open, we add an open to a Style to the document
- If it's a close we close the current Style
- Note - it's possible to have both opens and closes (for example with <br/>)
- It's also possible to have neither

Once we have completed constructing the document, we just need to convert it to output, currently that means using HtmlDocWriter to write it out.

Results

I only performed very preliminary testing against cmark. To do so I created a 250kb markdown file, by concatinating a smallish file to it's self 64 times. I then ran

$ time cmark -t html test-64.md > /dev/null
$ time meep -to html test-64.md > /dev/null

Real time is around 0.038s for cmark and around 0.035s for meep. So not a huge win - but I haven't performed any specific optimization on my implementation or profiled to try and work out where cycles are going. I'm happy it's competitive. Note that it may be slightly better than this - I needed to use a suitably large markdown file to overcome the fact that the meep binary is significantly larger than cmark. If I factor out that it may be more around 0.030s (so perhaps toward 27% faster).

Meep and supporting infrastructure is written in C++ and was compiled for these tests on Linux Mint 21.1 with gcc 11.4.0. The tests were performed on 12th Gen Intel i7-12700. The test file was approximately 2.5Mb of markdown.

Future Work

Table support based on github definition
Perhaps some other extensions, such as strike through
Support other output format, including text/markdown