Archive for the ‘Rant’ Category

Yesterday I needed to reinstall a Windows font that seemed to be misbehvaing. Over the years I have built up a fair bit of knowledge about Windows, so I thought I knew what to do:

  1. Delete the font from \Windows\fonts
  2. Locate the i386 directory on the install media for Windows (or my local drive if I ever copied it over)
  3. Find the compressed font in question; e.g. for Wingdings, the compressed file would be WINGDINGS.TT_
  4. Use the command line expand.exe tool to expand this to WINGDINGS.TTF
  5. Install the font by copying it to the \Windows\fonts directory.

If I remember correctly, the process has been the same for a long time. Back in the Windows 9x days there was no i386 (this was of Windows NT provenance), so you’d have to look in the cabinet (.cab) files. Before that, in the Win3.x/DOS days, you’d have to search the install media for any files you wished to reinstall. But overall, the process has been with us for some time, and the repository of files that is the i386 folder was the place to start.

The problem was that my Vista install did not contain a i386 directory. As a perused the directory structure of the DVD I discovered a directory named sources which contained a fair few files, but not nearly enough for a Windows install, and it didn’t contain the file I was looking for. In fact, the entire DVD didn’t seem to contain the file I was looking for. In fact, now I’d come to ask the question: “WHERE ARE ALL THE BL**DY INSTALL FILES?”.

And then I noticed install.wim, all 2,412,507,182 bytes of it (that is 2.24GB folks).

It is bound to be in there. It must be like the .cab file of yesteryear.

So I double-clicked it; no default viewer for this file type.

WinZip – “Not a valid archive”.

UltraISO (great utility) – “Invalid or unknown image file format”.

And then I thought “hang on – there must be a userspace tool that ships with Windows for this file format, even if it is command line”; after all, who’d ship an O.S. in a format that you can’t access unless you were the installer program? Clearly that would be silly.

So I searched my Windows installation, and I searched the install disc, and then I searched the web for info.

No.

Either I missed something obvious, or there is no quick way to get at the files contained in a .wim image file.

From all my searching, apart from the odd shareware utility that might have been able to do it, I found one official answer; the IMAGEX.EXE utility. A Microsoft tool that lets you mount a .wim file as a drive.

And where was this tool?

Available as a free download in the Windows Automated Install Kit.

And how big is this download?

Glad you asked.

It is 992.2MB.

3 observations to make here:

  • If I am right and there really isn’t a userspace tool to read .wim files that comes with Windows then that is absolutely pathetic.
  • I’m glad I got my broadband speed doubled last week.
  • Naturally the font I resinstalled didn’t fix the problem, so I still can’t print Franklin Gothic Book type in bold to my HP Photosmart 3310 printer.

In my last blog article I outlined a gotcha whereby a developer overrides .Equals() without providing a similarly meaningful override to .GetHashCode(). I gave a description, and illustration, of why it is so important to ensure that the same fields are considered for the two methods.

In this article I discuss a simple strategy for providing a .GetHashCode() implementation and compare it to some flawed equivalents. As with the previous blog article, my primary reason for writing about this is that I’ve seen some rather poor implementations of this, but it isn’t really complicated or difficult to come up with a reasonable solution.

Referring back to the same three requirements discussed previously in the MSDN object.GetHashCode documentation, the point that is most pertinent to this article is point three (since we’ve already discussed points one and two):

A hash function must have the following properties:

  • If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.

  • The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object’s Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.

  • For the best performance, a hash function must generate a random distribution for all input.

Points one and two above are absolute rules regarding the behaviour of .GetHashCode(). If you break either of these, then consumers of your hashcode (e.g. a hashtable) will not function correctly. As discussed in the previous article, a broken implementation may cause a hashtable to forever lose instances of your key placed within it and then report that it does not contain the key when it does. However, long as points one and two are obeyed, then your hash code (and hence hashtables) will work.

Point three is a recommendation. If you fail to adhere to this point, a consumer of your hash code will still function, it just won’t function very well.

 

A functional (but very bad) hashing algorithm

Consider the following implementation:

public override int GetHashCode()
{
    return 1;
}

This implementation does behave correctly with regard to the first two points above. If two objects are equal, then the method does return the same value; it just so happens that it also returns this value if they are not equal. That is fine and allowable – we are implementing a hash code, not a unique identifier. Point one above is satisfied. Additionally, our object will consistently return the same hash code (whether or not state has been modified). Point two above is satisfied.

It fails on point three though. Far from generating a random distribution for all input, it always returns the same number. Using this implementation will mean that all instances in a hashtable end up in the same bucket. This, consequently, means that the hashtable will have to do a .Equals() check on the full set of data to find a match. In other words, we have just reduced our hashtable to nothing more than a simple list (e.g. and ArrayList). No better (in fact probably marginally worse) than doing a foreach loop over the data set and manually looking for a match!

 

Other approaches to hashing

In order to examine possible hashing algorithms more closely, I decided to take a real world set of data, and use this as a basis of further investigation. The data set that sprang to mind was my music library, probably because I was staring at Winamp while trying to think of a suitable data source!

My music collection consists of a number of music tracks; for each of these there are several pieces of metadata. I exported my music library as iTunes xml and created the following interface to represent a single music track:

public interface IMusicTrackFile
{
    string TrackName { get; set; }
    string AlbumName { get; set; }
    string Artist { get; set; }
    string Format { get; set; }
    int? Year { get; set; }
    int? BitRate { get; set; }
    int? Size { get; set; }
    int? Time { get; set; }
    int? PlayCount { get; set; }
}

The aim of this interface is to represent track information that can then be stored as the key within a hashtable. Note that I am writing the word “hashtable” in lower case, because I am thinking of the general concept of a hashtable collection; in the .Net framework this includes System.Collections.Hashtable, as well as the generic System.Collections.Generic.Dictionary<> collection, and others.

Given that I will consider two tracks to be equal only if all the properties on the music track instance are the same, then ideally I want my hash code to be based on all these properties. The obvious approach here is to use boolean maths to combine the hash codes from the constituent properties to give one overall hash code.

Three immediate choices spring to mind to combine my constituent hash codes into one overall hash code: AND, OR or XOR. It doesn’t take a rocket scientist to realise that the only choice of these three worth considering seriously is XOR.

Why? Because ANDing many integers together will converge on all bits being zero; the number zero itself:

Hashcode graph for AND.

Every single track here gave a hash-code of zero – that is a 100% hash code collision rate. No better, in this particular case, than our “return 1″ implementation above.

ORing many integers together will converge on all bits being set; the number -1 in the world of two’s-compliment integer representation:

Hashcode graph for OR.

[Note: The Y-Axis range is 0 to -25 million here]

There is some distribution of values here, due to “flutter” in two or three bits; overall only 311 (out of 6142) unique values, and all in the negative number range. This is better than the AND scenario, but still hardly a good spread of values.

Only XOR gives anything like a decent spread of values:

Hashcode graph for XOR.

[Note: The Y-Axis range here is +25 billion to -25 billion; over 2000 times larger than the OR case]

This is much more like it. 6133 unique values (out of 6142). Note that these numbers are also spread throughout the entire range of a signed integer.

 

Summary

In the example above, XORing the hash-codes of all constituent fields that form part of an equality check produced a functional and well balanced hash-code. This may not always be the case, and I am not advocating using an “XOR everything blindly” approach, but it often does produce a very reasonable hash. It isn’t particularly difficult to output and plot hash code distributions for a sample range of your data, and see whether you have a reasonable algorithm.

Bear in mind, as well, that even a hashing algorithm that is halfway good is probably good enough to start off with. If your hashtables turn out not to be performing well enough, revisit the algorithm (rather than spend too much time indulging in the root of all evil)!

In case anyone is interested, my sample music library can be found here, and the Visual Studio 2005 solution I used during this analysis is here.