Messing with Non-Cryptographic Hashes

Posted on: 2015-04-11

As a side issue I've been doing some work on hashing. These results were produced on Linux on 1007U/x64 CPU...

                              Time   Hits  Hash
U Scatter (E)                17765     95  e3bf408c
U OneAtATime (E)             24078     91  c89a1410
A RabinKarp                   5840     91  8a2bee68
U RabinKarp                   5867     96  f2e8958c
A Bernstein                   6438     96  764b8998
U Bernstein                   6220     91  ed3197fc
A Hseih                       9464     91  f62f7790
U Hseih                       9954    101  4a7dffd4
A Fnv1aMeiyan                 4956    101  63b87c78
U Fnv1aMeiyan                 6105     97  1ff4fd14
A CrapWow                     6243     97  dea09e44
U CrapWow                     6705     94  15e59b0c
A Default                     6589     94  8a2bee68
U Default                     7178     96  f2e8958c

The A/U is if it is an aligned or unaligned implementation. The A means 32 bit alignment, U is unaligned. On x86 cpus you can do unaligned reads, and they are actually surprisingly fast. On other CPUs that would cause a fault, and there is another code path which does the alignment in software.

The (E) means that the algorithm is endian independent - generally meaning it reads data in byte order. As seen from the results it's typically 1/4 of the speed. That's not very surprising when you consider most other hashes are working with 32 bit reads, so can do 4 times as much work per iteration.

The results for 64 bit hashes...

U Scatter                    17303     96  e520d643ba36f0d0
A RabinKarp                   6358     96  55a767b88a2bee68
U RabinKarp                   6399     96  17481ff7f2e8958c
A Bernstein                   6195     96  70b01560764b8998
U Bernstein                   5051     91  ab135279ed3197fc
A Default                     6740     91  55a767b88a2bee68
U Default                     7120     96  17481ff7f2e8958c

The hits column is a fairly crude measure of how many times a bucket is hit using the hash. The hash is most often used with set/maps. In my implementation these are generally power of 2 sized. Approximately take the closest larger power of 2 and double it. The hits value measures how many times a power of 2 bucket was hit. In terms of hash usage, lower is generally better.

I liked this hash function article, and was interested in improving endian independent hashing - something that will be useful for serialization. I tired OneAtATime as it recommended but the performance is terrible. The scatter algorithm is very simple - it just reads a byte does a random table lookup and mixes by rotating and xoring. It's performance was better but still pretty slow.

So I decided to see if I could use RabinKarp. I already had code which allows it to work with unaligned data. As RabinKarp works by reading int32s there needs to be byte swap on one of the endian types - I chose big endian as it's more unusual.

In terms of more general usage my actual raw data may not be in 'byte order' in memory - an array of int16s for example will already have each int16 byte swapped for the endianess of the machine.You could do the endian flip for the uint16, and then endian flip the int32 you read, but you can do it more efficiently by just doing the int32 read, and doing short swap on that. With int32 data, no swap is needed (it's already in the right order), and on int64 - you need to do a int32 swap on the int64.

It took a while to get this all to work. I also wrote code to shift around, and resize chunks of data to hash, as well as some code to reorganise memory such that a block of memory when read would appear as it would on a flipped endian machine. This was important because I don't have an easy setup for testing big endian code. I could pull out the PowerPC mac mini - but that's going to be painful :)

In conclusion I'm pleased with the reorganisation, and it's good to have code that can give good analytic results to decide which code path is best to use. That said retrospectively the work on improving endian independent hashing was probably not really worth it. It's a fairly rare code path - just something about it got in my craw. It seemed simple enough, but there are lots of annoying details when you mix arbitrary endianess/alignment/length. Hey ho.