Using Fibonacci Hashing in hashmap.h
I’ve been noodling a lot with my single header C/C++ hashmap recently - hashmap.h. I knew from reading around that the power-of-two size requirement I had meant that it made getting a bucket location in the range was easy, but that it could result in suboptimal bucket selection. Various places on the interwebs said that using a prime number hash was better as doing the modulus by a prime would give a much better avalanche effect, but this would come at the cost of having to do an actual integer modulus operation for every hash - which is quite a bit slower.
Then friend of the blog Nathan Reed made me aware of Fibonacci Hashing, which is utterly wonderful.
Before my hasher looked like:
unsigned get_bucket(const void* const data, const unsigned length) {
unsigned hash = crc32_hasher(data, length);
return hash & (capacity - 1);
}
And now with the fibonacci hash:
unsigned get_bucket(const void* const data, const unsigned length) {
unsigned hash = crc32_hasher(data, length);
return (hash * 2654435769u) >> (32 - integer_log2(capacity));
}
2654435769
is chosen by doing pow(2, 32) / fibonaccis_golden_ratio
and
rounding towards the nearest odd integer (so that the bottom bit is set).
One of the claims of fibonacci hashing is that consecutive inputs will produce a good spread of outputs, meaning you don’t hit problems when multiple things with similar hashes get inserted next to each other. So how does it perform?
The four lines on the chart as as follows:
- avalanche - the percentage of locations that have at least half of their active bits flipped between consecutive inputs. The closer to 100% this is the better a hashing approach is regarded to be.
- average difference - how different two consecutive inputs are to each other by using a delta between them, as a percentage of the range. So an average difference of 50% with a bucket size 8 means that for any two consecutive inputs we’d expect them to have a difference of 4.
- min difference - for two consecutive inputs in the integer range that pass through the fibonacci hash, what is the minimum delta difference between them. So a bucket size of 16 with a minimum difference of 25% would mean that every consecutive input has a difference of at least 4.
- max difference - the opposite of the min difference!
Notice:
- From a bucket size of 8 and up (highly likely with most hashmaps in my experience!) the min, average, and max differences are all stable. I’m especially happy with the min difference - as this means that we can rely on the fibonacci hash putting a 38% delta difference between consecutive inputs.
- We average about a 47% difference between any two consecutive inputs.
- The avalanche percentage is quite lumpy - this is because any
log2(bucket_size)
that is odd will result in us under or over rounding when we check if at least half the bits flipped. I’ve leaned on under rounding any ties. - Overall we’re seeing about a 50% probability for avalanching to have happened between consecutive inputs. But remember - this fibonacci hash isn’t our actual hash! It’s what we apply after we’ve already hashed the key. So this will compound with the hash I actually use on the input keys, meaning this just helps us avalanche better than before. Nice!
I’m pretty happy overall with this - the hashing performance has no effective regressions in my benchmarks, but the hashing is much better as a result. I’ve landed this code in GitHub already so anyone who uses my hashmap.h library will already benefit from this.