Today I announce the worlds fastest markdown parser - meep
(*)!
NOTE! It might not (probably not ?) be the worlds fastest - I only tested against cmark ;)
I've been using markdown for many years. It seems pretty straight forward. But just like my previous discussion around the meep c preprocessor, it turned out to be a rather humbling experience.
I didn't need a complete implementation - only something that could be used as part of documentation support. On looking around for a markdown spec I came across commonmark. In the end though I did do a fairly complete commonmark implementation.
The spec is surprisingly complex. In my naivity I thought I'd use the spec as reference for checking things to make sure it is broadly consistent. I did write an original implementation that did most of the commonmark spec, but as I dug deeper into behavior - especially around links and images - it started to get increasingly complex to produce similar results.
It was at this point I looked at other implementations and found them surprisingly complex and started reading the bottom section of the commonmark spec which very helpfully explains a basis for an algorithm. I didn't follow exactly that algorithm, but used as a starting point. This probably undersells how much rewriting, and throwing away of code I did.
The c pre processor is a "language" that came out of many years of changes, implementation and improvements. On a cursory level it seems like it would be possible to implement it in may different ways, but in practice if you want to have it behave in the same way as production compilers, you likely end up having to implement it with something very similar to the canonical implementations. The same in the end happened with my commonmark implementation. If I wanted to follow the commonmark spec in the end I had to largely follow their parsing strategy.
I'm thankful for all those who wrote the commonmark spec. They unified the markdown concept. The spec isn't always easy to understand, but by making it so easy to test things out directly on the webpage and having numerous examples it was instrumental in this effort.
Document
type which provides a superset of rich text functionality&somename;
) - there are alot of themA summary of reasons that the commonmark parsing strategy needed to be used
I had some special self implied restrictions in my implementation
LineParser
type that simplifies working with line segments fromt he original source
char*
but requires no extra allocation or mappingMy initial versions tried to parse recursively. I was able to make this work to a degree, because I allowed backtracking on the construction of the Document
. Whilst that worked it always seemed like a bit of a hack not least because it required a "dummy" diagnostic collector to ignore errors/warning to be able to back track.
In the end the algorithm comes in several parts
LineParser
a type that allows disjoint lines
(or parts) to be processed as contiguous text or lines as needed
MarkdownParser
is the main main parser, it coordinates with the MarkdownLineParser
and the MarkdownTextParser
.
MarkdownLineParser
s parse tree and converting that into suitable Document
structuresMarkdownLineParser
is responsible for converting source file into a tree (aka block structure)
MarkdownTextParser
MarkdownTextParser
is reponsible for turning text spans into Document (aka inline structure
LineParser
to take spans of text and treat as contiguous textHtmlDocWriter
takes a Document
and converts to HTML outputI follow fairly closely the block structure part of the algorithm, although my representation is not pointer based but index based. I hold all nodes in an array.
For the "inline" text part of the algorithm, I follow the basic mechanisms that are used for link and image parsing. My stack is just a dynamic array though, and I don't have a doubly linked list. I have two passes here. I have a pass that produces a "delimiter list". This scans forwards in the text adding Delimiter
structures for emphasis, potential images/links, autolinks etc. It has a delimiter stack, and I use this when I hit image/link closing to determine what to do, in a similar way to the commonmark algorithm. When emphasis is seen I add flags marking how the marks should most likely be interpretted - as opens/closes or both. I also have flags to indicate if a Delimiter is used
or deativated
. If it is deactivated it can be ignored/removed. If it is not used
we haven't resolved how to interpret it yet.
Then I pass of the delimiters in a forward direction. At this point all of the link/images have been classified. Going forward has the advantage of having more natural/intuative results for other kind of inline markers. In this pass how all Delimiter
s are to be treated is resolved. This pass also ensures that we consistently open and close Delimiters such that they are balanced. After Delimiter
s are either Used
or Deactivated
. Finally we strip out anything that isn't used.
Delimiters record the span of the original markdown that they are associated with. So now to finally convert we in effect
&
) and markdown \
escaping to the outputStyle
to the documentStyle
opens
and closes
(for example with <br/>
)Once we have completed constructing the document, we just need to convert it to output, currently that means using HtmlDocWriter
to write it out.
I only performed very preliminary testing against cmark
. To do so I created a 250kb markdown file, by concatinating a smallish file to it's self 64 times. I then ran
$ time cmark -t html test-64.md > /dev/null $ time meep -to html test-64.md > /dev/null
Real time is around 0.038s for cmark and around 0.035s for meep. So not a huge win - but I haven't performed any specific optimization on my implementation or profiled to try and work out where cycles are going. I'm happy it's competitive. Note that it may be slightly better than this - I needed to use a suitably large markdown file to overcome the fact that the meep binary is significantly larger than cmark. If I factor out that it may be more around 0.030s (so perhaps toward 27% faster).
Meep and supporting infrastructure is written in C++ and was compiled for these tests on Linux Mint 21.1 with gcc 11.4.0. The tests were performed on 12th Gen Intel i7-12700. The test file was approximately 2.5Mb of markdown.