Fast Extendable C++ Serialization

Posted on: 2015-03-01

Serialization is hard. Let me clarify - a robust, fast, extendable, cross platform, easy to use serialization is hard. Serialization is the process of taking run-time objects (ie C++ classes and the like), and converting them into a form such they can be stored, so that later something akin to the original cluster of objects can be reconstructed.

In the past I've used the most noddy style of serialization - basically a form of the IFF format. In essence the IFF format is a form of binary container. It provides a consistent way of structuring your binary file, and then in terms of the actual contents, that's down to code typically writing structures directly from memory to the stream. If the structure changes you'll need to write the fixes directly into the read/write functions. This works and is fast and simple. It is what I have done in many old game engines but it's pretty fragile. You can easily break the file format, by changing something small. The read/write functions can get very complicated. More complex 'meta types' (such as containers) need special case calls/code and often ends up being roll your own time. The larger your code base becomes the more unmaintainable and complicated this becomes. That is not even talking about the mundanity of writing the read/write functions in the first place. Additionally the binary file is effectively impenetrable unless you have the codebase that understands it's hard-coded nuances. Even if you have a code base that understands it now - give it a year and it may not.

A while back I built a far more sophisticated serialization system using the Java binary format as it's basis. The idea behind using the Java serialization format was sort of like a far more rich version of IFF. It was well documented and you could test against it. It had the interesting, although admittedly never used property that you could serialize out C++ classes to serialize back in to Java. In this system generally you didn't write serialization functions - the serialization system reflection and did the majority of the work for you. Very occasionally you would have to write special case code for a particularly unusual transformation. It was a system that was easy to use, extendable(ish). I was also able to write code that would optionally read and write xml versions of the same file format. Whilst I was fairly happy with the system - I now consider it obsolete, why?

It requires the run-time types to work
It is slow
It is very complicated, and somewhat monolithic in structure
You cannot convert a binary representation to a readable representation
I had to break Java compatibility
There's no way to patch
Java compatibility made things far more complicated than they needed to be
Writing code to support another file type (say JSON) would be very hard
It contains lots of not great hacks

On the plus side, it exists and it works, and has many of the features that I would like from a serialization system. So I need pretty compelling reasons to replace it.

One of the reasons I had to break Java compatibility was because of value types. In Java the built in types (byte, int etc) are value types, but everything else is reference type. In C++ code you have a lot of non 'built in' value types. Vec4, Transforms, etc. Java couldn't represent these types without a reference type which added a lot of overhead. For example say I had a Vec2 (2 Float32) - am I really going to store it in a file format as a pointer to an object that contains to Float32, reference counts and all. How about an array of 10000 Vec2s? So I added a way for Java to support value types but in doing so broke compatibility.

Also having Java compatibility arguably made things significantly more complicated. Java has the concept of a class having 'extendability' - such that it implements Extendable, and can write data directly in the the stream. This worked at the level of each concrete type in a derivation - meaning you could have data streamed in and out not only at the overall type level, but at each level in the derivation of the type. This made traversing a type to serialize significantly more complicated.

So what features do I want from a serialization system?

Fast (especially for a binary representation)
Cross platform
Can serialize into text, and binary formats
Can convert a binary format into a text format without the runtime types existing
Extendable (ie adding a class that serializes in a special way shouldn't require 'core serialization changes)
Can patch (ie it's structure in malleable)
Ideally implementation is conceptually simpler (or at least less monolithic)
Ability to have fields serialized that are not in the run-time classes

In the old system one assumption was that there shouldn't be an immediate representation because that would waste memory. Also converting the run-time representation into the intermediate representation seems like a painful extra step. Also previously endian issues were handled by an abstraction at the stream level. That is I'd have a stream that I could write types to (such as float/int etc), and the implementation would byte swap if necessary, meaning that the file format always had the same endianess even if the run-time didn't. This was similar to how my inspiration - the Java serialization system - worked.

For my new system though I'm going to break a lot of those ideas.

First I will use an in memory representation. It will be converted to and from the run-time representation. The serialization code for specific file type will use the intermediate memory representation to read/write from. The serialization code won't have to know anything about the run-time types and data, just the intermediate representation.

Having an intermediate representation makes writing a lot of code much more simple. One of the original concerns against the idea was it seemed to require two copies of the data in memory - the original and the intermediate representation. In practice it doesn't have to be so as in many cases the data can just be shared. For example a raw array of int's in my run-time data is identical in my intermediate representation. It just so happens that arrays of this sort are often the place where the majority of data in big files resides. Having an intermediate representation means for different file formats that I just need code to deal with that representation not the intricacies of the run-time representations.

Secondly I will not require the same representation be produced from machines with different byte orders. There will be two 'duals' of every unique file - one for big endian and the other for little endian. This is code that can take either representation and do the endian conversion when the binary file is read. The serialization reflected representation is such that the conversion can be made without writing code to do any byte ordering in 'client code'. Simply look at the magic marker at the start of a binary file - if it's order is byte swapped, I read in the structure knowing I have to byte swap. For the 'overall structure' in the binary reader I will need to handle byte order, but for the types it describes I can generate code on the fly that will do the order change. The underlying idea here is that I can always convert to the write byte order for a specific platform, and even if the byte order is wrong it can be fixed very quickly at load-time. Note that handling endianness is not working at the stream abstraction layer - it's at a higher layer than that. I can read large chucks of data raw from the stream, and only have to do any work on it if the endianess is different.

Thirdly I will engage the power of virtual machines. What? Yes virtual machines. There is a new stage of work that the previous system did not have - the conversion to a standard intermediate format. In truth the previous system did have similar transformations - they just took place within the implementations of the binary/text stream serializers. In both cases they were very complicated, and slow. Here we have moved this stage of the transformation to before readers/writers act. To do the transformation in general we don't want to have to write code - we want it to be performed via reflection. If the code matched at run-time the representations it would be slow and would do repeated work. The actual effort includes matching up fields, doing type conversions, shifting data around and calling functions to do custom transformations. That's going to be lots of slow lookups, ifs, switches, calculations. That's wasted effort as the same work is repeated again and again for all instances of the same type. So lets not do that.

The majority of the work is as follows

Moving data
- Typically fairly small (a byte, a long etc)
- Small fixed size arrays
Doing built in type conversions
Calculating the source and destination offsets
Running custom code for special cases

That's a pretty small subset of things, and it's something that you can write a simple virtual machine to do. Moreover by having such a machine you can optimize the transformation to make it fast for the likely many times its called on each instance. For example I could have 3 fields a Float32, an Int32 and an enum (which maps to a UInt32 here) one after another in the representations. The virtual machines compiler knows all those types collapse to a UInt32 for the purposes of copying, which it can optimize to a single copy of 3 UInt32s. There are lots of other fairly simple optimizations that can be performed. This may take a bit of work - but it's one off work. I represented this transformation process with a base class called Remapper. If during profiling you find the virtual machines transform is too slow for your use case, you can always derive from Remapper and implement in C++ code for the best possible performance.

It may not have been obvious above, but one of the major ideas behind the intermediate representation is that it would be very close to the binary serialization representation. So close in fact that to write out an object in this representation I can just do a straight stream write. A read is nearly a as simple - with straight stream read and possibly an endian flip. It doesn't get much faster than that.

In my original idea I was going to make the intermediate representation for the whole system be writable directly. That would be very fast and in some respects very simple! But whilst it took a while to let go of, it doesn't work for my purposes because I want to be able to 'patch' the intermediate representation, and it's also a too low level representation for say how a xml reader/writer would work. So instead of the intermediate representation being literally the binary file format it is truly intermediate - something between the run-time and the serialization structures.

That's an overview for now, I'm sure I'll get into more of it with later posts.