RoBlog – Refactoring Reality

Discrete thought packets on .Net, software development and the universe; transmitted by Rob Levine

Implementing a simple hashing algorithm (pt II)

by Rob Levine on 3-Apr-2008

In my last blog article I looked at implementing a hashing algorithm by trying different boolean mathematical operations on the constituent fields of our class. It was very clear that out of AND, OR and XOR, only XOR provided us with anything like a balanced hash code. However, although it worked well in the previous example (a music library), all is not quite as it seems. On closer inspection of the behaviour of the XOR hash, it turns out that this hashing algorithm has its own flaws and is not ideal in most situations.

 

The Commutativity Issue

The most obvious problem with the "XOR" all fields approach to hash codes is that any two fields XOR’d together will give you the same value, regardless of the order in which they are XOR’d (i.e. XOR is a commutative operation). This increases the chance that you will get hash code collision, which of course is a bad thing. Consider the following example (all fields contributing to hash):

Forename Middle name Surname
John Paul Jones
Paul John Jones

Clearly the two people above are not the same person, but they will have the same hashcode if we generate our hash with a simple XOR of the hash codes of the constituent fields.

As mentioned in previous posts, a hashcode is not a unique identifier, and the fact both people have the same hash-code won’t break a hashtable. However, it will lead to a less efficient hashtable, as both our people will end up in the same bucket of the hashtable when, ideally, they shouldn’t.

 

The Range Issue

Another problem that the XOR approach doesn’t address is that of the potential range of values of a hashcode. In my previous post I showed that the XOR implementation of hash code seemed acceptable for my music library example. However, in reality I was relying on the properties of the individual hash codes that made up my overall hash code. Specifically, many of the fields that contributed to the hash were strings, and the BCL implementation of System.String.GetHashCode() has a pretty good distribution. Had my music track entity not contained several string fields, things would have looked very different.

But what about fields that may have a poor "innate" distribution? Given that System.Integer.GetHashCode() returns the integer itself, what happens if I have an field representing a person’s age? The spread of values is, at best, 0 < age < 120, which is hardly the distribution across integer-space that you might want. A person’s height in cm? 0 < height < 250.

You see the problem? I don’t really want to be combining all these hash codes in the very low integer ranges using the XOR approach because it means my final "cumulative" hash code will be stuck within this range as well.

 

Examining the flaws in more detail

I don’t pretend to be an expert on hashing algorithms, but I can see that we have issues here and so the best way of discovering more (for me, at least) is to work through a broken example and see what I can do to fix it.

I very quickly settled on the idea of trying to write a replacement String.GetHashCode() implementation. I would get a dictionary of words (all in lower case), and try and write a hash code implementation based on the ASCII code of each letter in the word. Given that the the hashcode of our ASCII codes would be the ASCII code itself (since the hashcode of an integer is the integer itself), we would have a poor distribution of individual character hash codes; all in the range 97 (a) – 122 (z). This approach would also highlight problems such as commutativity (since many words have the same letters, but just in a different order).

My XOR implementation for this approach would look like this:

public int GetHashCode(string word)
{
    char[] chars = word.ToCharArray();
    int hash = 0;
    foreach (char c in chars)
    {
        int ascii = (int)c;
        hash ^= ascii.GetHashCode();
    }
    return hash;
}

I sourced my dictionary from here, and de-duplicated it (and converted the words to lower case) to produce this list of words.

As expected, this algorithm (referred to as AsciiChar_XOR in the diagram) has a spectacularly bad value distribution, as shown in this histogram:

Histogram for ASCIIXOR.

[Note that the width of each bucket here is 67108864, being 2^32 / 64]

Surprise, surprise – all 2898 words fall into the same histogram bucket! In fact they all fall into the far narrower range of 0-127 – the ASCII code range.

All of a sudden the simple XOR approach to hashing looks like a very poor performer indeed.

 

Examining better algorithms

Since we’ve already discussed the weaknesses of XOR, we should have a fairly good idea of where to focus our attention to create better algorithms. Firstly we should be choosing an algorithm that is non-commutative, and secondly we should be choosing an algorithm that uses the full range of integer-space, rather than limiting a hashcode to the range of its constituent members.

A bit of a search around reveals two approaches that are often discussed. The first one, given to me as a boiler-plate example by a Java developer friend, looks like this:

public override int GetHashCode()
{
    int hash = 23;
    hash = hash * 37 + (field1 == null ? 0 : field1.GetHashCode());
    hash = hash * 37 + (field2 == null ? 0 : field2.GetHashCode());
    hash = hash * 37 + (field3 == null ? 0 : field3.GetHashCode());
    return hash;
} 

[I shall refer to this type of algorithm as JavaStyleAddition as it seems to be a very common implementation in the Java world]

The second common pattern looks something like this:

public override int GetHashCode()
{
    int hash = 23;
    hash = (hash << 5) + (field1 == null ? 0 : field1.GetHashCode());
    hash = (hash << 5) + (field2 == null ? 0 : field2.GetHashCode());
    hash = (hash << 5) + (field3 == null ? 0 : field3.GetHashCode());
    return hash;
} 

[I shall refer to this type of algorithm as ShiftAdd]

Both of these have some similar characteristics. By taking the cumulative hash so far, applying an operation (multiply for JavaStyleAddition and left-shift-5 for ShiftAdd) and then adding the new field’s hash code, they both avoid the commutativity issue since

(x * n) + y is not generally equal to ( y * n ) + x.

They also both increase the range of the cumulative hash code above that of the constituent hash codes due to the effect of multiplying the cumulative hash code each time (remember that a single left shift is a multiplication by two).

You will also notice that both approaches use a prime number (23 in the examples shown) as the starting value for the hash, and JavaStyleAddition uses another prime (37) as the multiplier. My guess is that this, statistically, makes collisions less likely as you multiply up your hash code because if one side of the multiplication has no factors (other than 1 and itself), then you are lowering the statistical average number of factors of the result. Of course, I may be wrong about that :-s

A variant of ShiftAdd that I have seeing during my Google journeys is one in which the hash codes are shifted and then XOR’d (rather than added). I shall refer to this as ShiftXOR.

Histogram for SX_JSA_SA.

This certainly looks better than the histogram for AsciiChar_XOR 😀

However, all three algorithms still cluster around the centre of the number range, and all exhibit other major spikes in distribution.

What I could have done at this point is break into a major mathematical and statistical analysis of these three hashing algorithms, but I decided against it for two key reasons. Firstly – I would have got bored, and secondly – I wouldn’t have had the first idea where to start!

Nope – I felt it better to fall back on my hacker instincts and munge various forms of the above algorithms to see if I could produce a better algorithm for my particular use case.

I quickly came up with two further algorithms, both combining the left-shift approach with the the prime-add approach of the above algorithms. The difference between them being only that one adds the hashcodes each time, while the other XORs them:

 

public override int GetHashCode()
{
    int hash = 23;
    hash = ((hash << 5) * 37 ) + (field1 == null ? 0 : field1.GetHashCode());
    hash = ((hash << 5) * 37 ) + (field2 == null ? 0 : field2.GetHashCode());
    hash = ((hash << 5) * 37 ) + (field3 == null ? 0 : field3.GetHashCode());
    return hash;
} 

[ShiftPrimeAdd]

and

 

public override int GetHashCode()
{
    int hash = 23;
    hash = ((hash << 5) * 37) ^ (field1 == null ? 0 : field1.GetHashCode());
    hash = ((hash << 5) * 37) ^ (field2 == null ? 0 : field2.GetHashCode());
    hash = ((hash << 5) * 37) ^ (field3 == null ? 0 : field3.GetHashCode());
    return hash;
} 

[ShiftPrimeXOR]

For good measure, I threw these into the mix alongside the standard BCL implementation of System.String.GetHashCode() [labelled as StringGetHashCode in the diagram] to see how they would fare:

Histogram for SPX_SGHC_SPA.

[Note that the y-axis for this histogram is half that of the previous diagram]

Now – that is MUCH more like it. We have a well balanced hash code distribution across the entire range of integer space. There are a few minor spikes, but all three seem to compare favourably with each other. The approach of including the prime and ‘multiplying’ up each time really does seem to do the trick.

Which would I chose out of ShiftPrimeXOR and ShiftPrimeAdd? Not sure – I’d have to benchmark them first and see which was fastest!

 

Conclusion

In summary, just XORing fields together may well produce an awful hash code distribution, unless the constituent field hash codes are themselves well balanced.

However, during the course of writing this article, I have realised that there are other relatively simple implementations that provide a good hash code distribution (for this example at least). More than that, I have reinforced my belief that these things are best checked out if you have any doubts. It doesn’t take long to put together a test harness and profile your algorithms with a sample of your data.

One thing I have omitted from my investigation is any discussion on the speed of the algorithms. It would be worth benchmarking each one because if a class is being used in a hashtable, its .GetHashCode() method is called for each .Add, .Remove and .Contains call. However, the following thought does occur to me; with these sort of repetitve mathematical operations, the relative speed of XOR vs. shift vs. multiply (etc) may well actually depend on your CPU architecture.

On reflection, there is a lot more to hashing than the small amount I know and I’m sure many mathematical research papers have been written on the subject.

In the future, my default choice will probably lean towards ShiftPrimeXOR or ShiftPrimeAdd as a starting point. It would be a waste of time to spend days up front trying to work out the perfect hashing algorithm. My approach would be to choose one, use it, and keep an eye on its performance. If it proves to problematic then consider optimising it, otherwise leave it alone.

Right – enough already about hashing algorithms!

A Visual Studio 2008 project containg the console application I used to generate these results can be found here.

2 thoughts on “Implementing a simple hashing algorithm (pt II)

  1. Oliver Hallam says:

    I’m not convinced that this is necessarily the best way to create a hash function. Firstly note that x << 5 is the same as x * 32, and so your shift and multiply is the same as x * (32 * 37).

    Secondly note that each time you combine your hash codes you are pushing the hash code back 5 bits and the bottom 5 bits are determined only by the current character. Thus after combining 7 characters, the first character no longer has any effect on the hash code, and so the shift-add or shift-xor hash functions only actually consider the last 5 characters in the string!

    In fact your combination of multiplying and shifting also has this same issue – you have effectively shown that with your data set you can get a good enough hash code by only considering the last 7 characters (which is an interesting observation in itself).

    My personal preference for creating hash codes (when dealing with a simple class with a fixed number of members) is to add a (unique) hard-coded random integer to the hash code of each member (this is known as salting) and xor the results of these. Assuming all the members have well distributed hash functions then this should too; although I have not performed a thorough analysis.

  2. admin says:

    Hi Oliver,

    Thanks for your comments.

    With regard to your first point (x<<5 == x*32) - yes agreed; the reason I included the left shift variant was because it appear many times on various forums and articles and wanted to see how well it measured up. My gut feeling (without a full mathematical basis, I admit) is that it is still always going to be better to multiply by a prime (as in the "JavaStyleAddition"). Your second point (shifting left leads to the impact of each byte being removed from the hashcode after a certain number of characters); yes - I see what you mean. In fact your comments now make me want to revisit the last two hash code algorithms and lose the left shift. It might be worth (IMHO) multiplying by a larger prime to increase the spread in values that I was achieving with the shift, but overall a straight shift is obviously bad as it leaves zeros in its wake. I wonder how a rotate rather than shift may fare here. If I get time I'll write a follow up and explore this. I was interested by your last point about salting. I have come across salting before when creating cryptographic hashes, for the purposes of increasing the strength of the hash and preventing someone "reverse engineering" a password from its hash. By way of illustration I mean the situation in which I calculate the hash of passwords and then store these (rather than the passwords) in a database. If someone subsequently was able to read the hashes from my database, they could reverse engineer many passwords by generating hashes for every word in a dictionary and then seeing if they got any matches in the hash table. Salting the hash, perhaps with the userid, prevents this kind of attack. However, I'm not 100% clear how that would work in this case. You say "add a (unique) hard-coded random integer to the hash code of each member (this is known as salting) and xor the results of these"; do you mean a *single* hard-coded number to use with all hash codes, like this: public override int GetHashCode() { const int SALT = 278466743; // a random number int hash = SALT + (field1 == null ? 0 : field1.GetHashCode(); hash ^= (SALT + (field2 == null ? 0 : field2.GetHashCode()); hash ^= SALT + ((field3 == null ? 0 : field3.GetHashCode()); return hash; } I can see that this wouldn't suffer from the problem of "shifting away the impact of each character", but my gut feeling is that this would lead to a poor distribution of hash codes across integer space; in the example of using ASCII values, the above wouldn't produce any negative numbers. Even if I changed the salt to be negative, it would not address the range issue. In any case, this is clearly food for thought, and there is obviously scope for my to try more experimentation on this. Thanks again for your comments; they've made interesting reading. Rob