RoBlog – Refactoring Reality

Discrete thought packets on .Net, software development and the universe; transmitted by Rob Levine

A cache key with a bad GetHashCode() is another name for a memory leak

by Rob Levine on 7-Mar-2008

[With apologies to Raymond Chen for the title]

.GetHashCode() – the method exists on every single object you create; and yet I’ve come to the conclusion that a lot of developers neither know much about it, nor care about it. It is almost as though it is considered of no consequence.

Hash codes are not really very complicated in principle, but far from being of no consequence, they are critical whenever you are designing a class that may be used as a key in a collection (like a hashtable or generic dictionary). Failure to implement this method correctly can cause havoc.

The tale of the bad hash code

Some time ago, at a previous gig, we were developing a web application and we decided to that we would provide a crude cache for small amounts of data that were used repeatedly but never change (by never I mean that it changes so rarely that a recycle of the web app would be permissible; e.g. a list of U.S. counties). The aim here was to minimise the number of database calls for these small, frequently used and immutable sets of data. For reasons, lost in the mists of time, this cache was provided via a static Hashtable (rather than, for instance, the System.Web.Caching.Cache).

We were building on some building blocks “donated” to us by another project, and one of these included a class called CacheKey.

The idea was simple enough; each collection in the cache would have a name, (e.g. “USCounties”), and zero or more optional parameter fields (e.g. if you were caching U.S. counties in Texas only you might have name, “USCounties”, and a single field with the value “TX”).

The class looked something like this:

public class CacheKey
   public string Name;
   public object[] Fields;

Whoever wrote this class had overridden the .Equals() method in order to implement a notion of key equivalence; two keys are equal if their Names are equal, and their collection of Fields also have equal contents (ignoring null checking on Fields for brevity):

public override bool Equals(object obj)
    CacheKey supplied = obj as CacheKey;  

    if (supplied == null)
        return false;  

    if (this.Name != supplied.Name)
        return false;  

    if (this.Fields.Length != supplied.Fields.Length)
        return false;  

    for (int i = 0; i < this.Fields.Length; i++)
        if (this.Fields[i] != supplied.Fields[i])
            return false;
    return true;

However, they had overridden .GetHashCode() with the following implementation:

public override int GetHashCode()
    return base.GetHashCode();

It may be that no-one intentionally coded this method, but that they responded to the compiler warning ” ‘MyProj.CacheKey’ overrides Object.Equals(object o) but does not override Object.GetHashCode() ” by letting the IDE auto-generate the method (which gives exactly this implementation).

The problem is that, whether auto-generated or hand crafted, this is probably the worst implementation of GetHashCode possible. (Not that I am blaming the IDE; I’m sure the motivation for this auto-complete is to allow your code to compile, not to provide a meaningful implementation).


Why is this implementation so bad?

The golden rule of a hash code is that your GetHashCode implementation must match your Equals implementation. What does this mean? It means exactly the same fields that contribute to the equality check must also contribute to the hash code.

This is covered by the first two in the list of three requirements stated in the MSDN object.GetHashCode documentation:

A hash function must have the following properties:

  • If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.

  • The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object’s Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.

  • For the best performance, a hash function must generate a random distribution for all input.

There is no need to regurgitate the many descriptions of how hashtables actually work, but the very lightweight overview is this:

  1. Keys in the hash table are stored in individual buckets to reduce the overall search space when looking for an item in the hash table.
  2. The result of .GetHashCode() determines which bucket a key is stored in (i.e. when you Add an item to the hashtable, it calls .GetHashCode() on the key and uses the resulting value to determine which bucket to place the key in).
  3. When a “contains” check [e.g. hashtable.Contains(newKey)] is performed, the key being checked (newKey) has its .GetHashCode() method called. Based on this result, the “contains” method determines which bucket any possible matches may exist in, and performs an equality check on every item in this bucket until a match is found.

Point three means it is critical that two “equal” keys return the same hash code, otherwise the whole basis on which a hash table works comes crashing down. If we get a different value every time we call .GetHashCode() on different instances of equal keys, then our hashtable will forever be looking in the wrong buckets to see if we have a match; inevitably it will keep finding that we don’t have a match.

To put it another way, imagine trying to find your name in the phone book if you couldn’t rely on your brain’s ability to determine the first letter of your surname.

If every time I called brain.GetFirstLetterOfSurname() I got a random letter (instead of “L” for Levine), then most of the time I’d be trying to find the name “Levine” in the section for another letter.


What is the effect of this broken implementation?

So, back to the original issue; what is the effect of calling base.GetHashCode()? Our class inherits directly from System.Object, so we are calling System.Object‘s implementation of .GetHashCode(). I don’t know the detail of it’s implementation, but calling it 5000 times, each for a new instance of a CacheKey with a name of “USCounties” produced this:

Hashcode graph for broken implementation

which is bad; a different value for each call. What we wanted to see was the same hash code being returned for every instance (since they are all equal). In other words we wanted this (a straight line):

Hashcode graph for working implementation

What is the net effect of this in our cache? When I first do a call to the cache to get my list of counties, the cache informs me the data isn’t already cached (i.e. a cache miss) and so a call to the database is made, the data is retrieved, and this new collection is added to the cache. This is correct behaviour.

But on the second call to get my list of counties, the cache still informs me the data isn’t already cached and so we go to the database again, retrieve the data and add it to the cache. And again on the third call, and so on.

This gives us two issues. Firstly my cache is broken; I am making a database call every time I request the data, which defeats the purpose of the cache. But much much worse than that – every time I am actually caching the data and then losing it in the cache. So after 5000 requests for the data, I have it cached 5000 times*:

I have one great big memory leak.

The moral of the story here is don’t implement your own cache-key type classes unless you actually implement .Equals() and .GetHashCode() together in a compatible way. In fact, ninety-nine times out of a hundred you probably won’t need to do this; using a string or integer as a cache-key has no such issues – it only becomes an issue when you implement your own.

Implementing a simple, but reasonable hashing algorithm is fairly simple, and I’ll look at this in my next post.

* Due to the statistical likelihood of the occasional “random” hash code collision I may have the odd cache hit, so maybe I am only wasting 4998 instances instead of 4999!

Making the .Net XmlTextReader accept colons in element names.

by Rob Levine on 6-Mar-2008

This started as an addendum, to What is a valid XML element name?, but then I discovered something that made it worth breaking out into a separate post!

Ayende added a comment to his blog (under my comment) to say that he tried the ‘bad’ xml in question on three parsers and none of them could handle it. Naturally I thought I’d have a quick go too.

First I tried with the .Net System.Xml.XmlDocument class and the System.Xml.XmlTextReader class and neither of these would handle the “double-colon” element names. Next I tried two commercial XML editors, XmlSpy and StylusStudio, both of which were happy to let it pass their well-formed check without complaint (and they do both start complaining if you add other non-allowed characters). I don’t know what parsers either of these products are built on, but on the surface they seemed to be more compliant than .Net

Or so it appeared. One thing I noticed was that both System.Xml.XmlDocument class and System.Xml.XmlTextReader classes barf with the same exception, being raised from within System.Xml.XmlTextReaderImpl.ParseElement().

A quick look at this method using Lutz Roeder’s excellent Reflector revealed something new and interesting. This class (System.Xml.XmlTextReaderImpl) has an internal boolean property, Namespaces, which changes the behaviour of this element parsing method to allow or disallow multiple colons. This makes sense when you think about it; if you don’t support namespaces then there is no issue with multiple colons. If you do support namespaces then the colon is reserved to separate the namespace prefix from the element’s local name. It is this very point that the XML RFC refers to regarding colons, and which I quoted in the previous article.

A closer look still revealed that a Namespaces property is exposed on the System.Xml.XmlTextReader class. And guess what? Setting this property to false allows the reader to start accepting “multi-colon” element names! Well – that is certainly a new one to me.

However, I couldn’t find an equivalent way of changing the System.Xml.XmlDocument‘s behaviour to accept this type of xml. To be honest, I’m not too bothered, because I can’t imagine using this particular style of xml any time soon!

What is a valid XML element name?

by Rob Levine on 5-Mar-2008

Ayende Rahien has noted some peculiar looking XML that is output by Subversion.

Specifically, he takes issue with a start tag of


Well – I can honestly say that I’ve never seen XML like this before (and I’ve been using XML since the late Cretaceous Period) but I’m not so sure that it is actually wrong! Although I too balk at the sight of it, the XML RFC does seem to permit this as valid and parsable XML.
Taking a look at section on start tags would appear to allow for this:

A start tag is defined by

STag ::= ‘<‘ Name (S Attribute)* S? ‘>’

where Name is defined as

Name ::= (Letter | ‘_’ | ‘:’) (NameChar)*

and NameChar is defined as

NameChar ::= Letter | Digit | ‘.’ | ‘-‘ | ‘_’ | ‘:’ | CombiningChar | Extender

What this says to me is that this is an allowable start tag. In fact, from a syntactical standpoint, there seems to be no limit to the use of colons in element names.

However, the RFC does also have this to say about the use of colons:

“The Namespaces in XML Recommendation [XML Names] assigns a meaning to names containing colon characters. Therefore, authors should not use the colon in XML names except for namespace purposes, but XML processors must accept the colon as a name character.”

In other words – other parts of related XML specifications reserve the colon, but from a purely XML markup standpoint this would appear to be valid.

It may be a WTF, and it certainly isn’t nice but I think it is actually valid and is not “wrong, period” (as claimed), just “wrong, it makes me feel weird“!

What is *more* of a leap year? 2007 or 2008?

by Rob Levine on 10-Jan-2008

I am being a little facetious with the title here, but it is a point I try to bear in mind (if only because it stops me falling into the year length trap repeatedly).
If you are thinking in terms of a calendar year then 2008 is a leap year and 2007 is not. Full stop.

But if you are working with periods of a year (be it insurance policies, people’s ages, shelf lives of products or whatever) then actually 2007 is more of a leap year.

A year from 1st Jan 2008 has 366 days. A year from 1st Feb 2008 has 366 days. But by the time we get to 1st March 2008 (i.e. we have passed 29th Feb 2008), a year in the future is only 365 days away.
In other words you only have 31+28 = 59 days in the year 2008 for which a year forward is 366 days (60 days if you consider one year past 29th Feb 2008 to be 1st Mar 2009)

But for 2007, once you have passed 28th Feb 2007, then a year forward is 366 days, not 365. In other words for over 4/5ths of 2007 a year in time is 366 days.

So I’d rather call 2007 the leap year thank you very much 😉