<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>RoBlog - Refactoring Reality (in C#)</title>
	<atom:link href="http://blog.roblevine.co.uk/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://blog.roblevine.co.uk</link>
	<description>Discrete thought packets on .Net, software development and the universe; transmitted by Rob Levine</description>
	<lastBuildDate>Wed, 30 Jun 2010 21:58:52 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>ConfigGen &#8211; .NET Configuration File Generator</title>
		<link>http://blog.roblevine.co.uk/?p=64</link>
		<comments>http://blog.roblevine.co.uk/?p=64#comments</comments>
		<pubDate>Wed, 30 Jun 2010 15:10:04 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[.Net]]></category>
		<category><![CDATA[ConfigGen]]></category>

		<guid isPermaLink="false">http://blog.roblevine.co.uk/?p=64</guid>
		<description><![CDATA[After a week and a half off work and on holiday, which consisted of me working solidly on my new pet project, I am pleased to announce the first (alpha) release of ConfigGen &#8211; .NET Configuration File Generator.
ConfigGen is a .Net configuration file generation tool, and it is intended to make the .Net developer&#8217;s life [...]]]></description>
			<content:encoded><![CDATA[<p>After a week and a half off work and on holiday, which consisted of me working solidly on my new pet project, I am pleased to announce the first (alpha) release of <b>ConfigGen &#8211; .NET Configuration File Generator</b>.</p>
<p>ConfigGen is a .Net configuration file generation tool, and it is intended to make the .Net developer&#8217;s life easier. Much easier, if my experience of how some dev teams go about managing their configuration files is anything to go by! I&#8217;ve written various forms of this tool at several different places, all to tackle the same basic problem of having to manage loads of different configuration files for different machines and environments. This version on CodePlex is completely new, started from scratch, with no code borrowed or copied. Hopefully it is all the better for it!</p>
<p>It is a pretty simple tool; no rocket science was necessary. Basically, you maintain an xml template and a settings spreadsheet. The template looks pretty much like your existing configuration file, except that values that are different between machines/environments are tokenised. The settings spreadsheet has a list of all the tokens across the top, and all the machines/environments requiring their own config down the left hand side. Predictably enough, ConfigGen generates one configuration file for each row in the spreadsheet, replacing the tokens with their spreadsheet values as it goes. Additionally, it allows the structure of the configuration file to be changed, depending on the value of these settings in the spreadsheet.</p>
<p>This release is an early alpha version, but have a look and let me know what you think. So please, head over to the project&#8217;s CodePlex page and hopefully save yourself the frustration of discovering your config files are out of sync on your next push to your production environment!</p>
<p>Check it out at:</p>
<p><a href="http://configgen.codeplex.com/">http://configgen.codeplex.com/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.roblevine.co.uk/?feed=rss2&amp;p=64</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Refix &#8211; .NET dependency management tool</title>
		<link>http://blog.roblevine.co.uk/?p=56</link>
		<comments>http://blog.roblevine.co.uk/?p=56#comments</comments>
		<pubDate>Mon, 14 Jun 2010 12:15:32 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[.Net]]></category>

		<guid isPermaLink="false">http://blog.roblevine.co.uk/?p=56</guid>
		<description><![CDATA[Just a short blog entry to highlight an exceptionally cool project that a colleague of mine has been working on in his spare time: Refix. A .NET dependency management tool. 
This tool attacks the problem of having a project with dependencies, where these dependencies themselves may have common, but different version, assemblies.
I&#8217;ve certainly had this [...]]]></description>
			<content:encoded><![CDATA[<p>Just a short blog entry to highlight an exceptionally cool project that a colleague of mine has been working on in his spare time: <b>Refix. A .NET dependency management tool</b>. </p>
<p>This tool attacks the problem of having a project with dependencies, where these dependencies themselves may have common, but different version, assemblies.</p>
<p>I&#8217;ve certainly had this problem many times: you have a solution that uses other third party assemblies all of which have common dependencies (such as log4net, perhaps), but different assemblies rely on different versions of this common item. </p>
<p>Refix helps you tackle this problem in two ways:</p>
<p>Firstly, it goes through your solution to work out if there is actually a common set of dependencies that can be used. If so it can re-write your project files accordingly.</p>
<p>If there isn&#8217;t, you can supply it with a list of compatible versions (e.g. tell it v 1.1.1.0 is compatible with v 1.1.2.0), and it can automatically insert the correct assembly redirects for you into the applicaiton configuration files.</p>
<p>It is looking to be an excellent tool so far. It is in the early stages, but completely useable and it tackles a problem that I&#8217;ve not seen solved in .Net before.</p>
<p>Check it out:</p>
<p><a href="http://refix.codeplex.com">http://refix.codeplex.com/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.roblevine.co.uk/?feed=rss2&amp;p=56</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Backing up VMs in VMWare ESXi 3.5</title>
		<link>http://blog.roblevine.co.uk/?p=41</link>
		<comments>http://blog.roblevine.co.uk/?p=41#comments</comments>
		<pubDate>Fri, 30 Apr 2010 19:31:22 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Backups]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[VMWare]]></category>

		<guid isPermaLink="false">http://blog.roblevine.co.uk/?p=41</guid>
		<description><![CDATA[I&#8217;m a massive fan of VMWare products. I use the excellent VMWare workstation for most of my development on the desktop, and I use VMWare ESXi (3.5) for hosting my virtual servers.
VMWare ESXi is a truly fantastic product, and it is free, but with this free version you are limited in the management toolset available. [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m a massive fan of VMWare products. I use the excellent VMWare workstation for most of my development on the desktop, and I use VMWare ESXi (3.5) for hosting my virtual servers.</p>
<p>VMWare ESXi is a truly fantastic product, and it is free, but with this free version you are limited in the management toolset available. One thing that is missing is an easy way to back up your VMs.</p>
<p>Over time, I&#8217;ve tried a few approaches to backing up &#8211; none really good enough:</p>
<ul>
<li>Using the download option in the &quot;datastore browser&quot;. Not reliable for large files, and most of my VMs are &gt; 30GB </li>
<li>Enabling ssh in the VMWare ESXi hypervisor, and using SCP to download the vm files. Again, not reliable for large files </li>
<li>Using the “VMWare vCenter Converter Standalone” to copy/convert my VMs to local copies. This is often slow though, and I don&#8217;t really understand what aspect of my VMs are being &quot;converted&quot;, and to what. In short, I don’t really understand what is happening and if it might impact the stability or performance of any restored VMs, so I am nervous about it. </li>
<li>I&#8217;ve investigated and given up on trying to get smbfs support running in the hypervisor shell. The idea here was to allow me to mount a remote Windows share, and just copy the vm files over. However, no smbfs support built in. No installer mechanism (e.g. no <span class="code">rpm</span> and no <span class="code">apt-get</span>), no compiler and no iso9660 support that would let me at least mount a cd-rom. This hypervisor is <em>thin!</em> </li>
<li>I tried using a Ubuntu 9.04 live CD, so I could start up the ESXi host machine in Ubuntu (of course, it means ESXi VM host server would not be available during this time), but then realised there was no vmfs support in Ubuntu so I wouldn&#8217;t be able to mount the disks containing the VMs. </li>
</ul>
<p>All is not lost though; ESXi does support NFS natively (as a client), so it is relatively easy to create an NFS &quot;backup share&quot; on another server</p>
<p>One thing worthy of note &#8211; I have tried creating an NFS server on my Windows XP box (using Windows Services for UNIX 3.0) and on my Windows 2008 server (using Services for Network File System). In both cases I could not get ESXi to play nice with the share, so I switched back to using Linux to host my NFS server. Right tool, right job, and all that.</p>
<p>My setup will be this &#8211; a VM running Ubuntu (10.04) which exports an NFS share that can be used to copy the backup VM files to and from ESXi (from within the ESXi shell). Optionally (and beyond the scope of this article), this exported directory can also be shared out via Samba, to give your Windows clients access to the backed up files.</p>
<p>These notes are a list of how I did this specifically. Your mileage may vary:</p>
<h4>Initial setup of your Ubuntu environment</h4>
<ul>
<li>Install Ubuntu (10.04) to a VM, a physical box, or somewhere. </li>
<li>Make sure you have plenty of disk space to host the backups on, and create your target shared directory. For this example I&#8217;ll be using /backups </li>
<li>Set the permissions accordingly on this directory </li>
</ul>
<h4>Installing NFS and exporting your share</h4>
<ul>
<li>Install your NFS server:
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:3596613e-171b-473d-afbd-34431cd34df6" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">sudo apt-get install portmap nfs-kernel-server</pre>
</div>
</li>
<li>Export your shared directory by editing /etc/exports and adding the following (noting that 192.168.0.1 should be replaced by the ip address of your ESXi server):
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:2a5019df-5fee-43f8-b000-44c1d4e895c5" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">/backups 192.168.0.1(rw,async,no_subtree_check)</pre>
</div>
</li>
<li>Issue the command
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:2a0622ed-c846-43d7-841d-f7bca073b6a8" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">sudo exportfs -ra</pre>
</div>
</li>
<li>Start/Restart the NFS service by issuing the command:
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:853fe696-3fbe-493b-b8d7-28db08f23ce4" class="wlWriterEditableSmartContent">
<pre class="brush: text;">sudo /etc/init.d/nfs-kernel-server restart</pre>
</div>
</li>
<li>You should now have an exported NFS share! </li>
</ul>
<h4>Mounting your NFS share from ESXi</h4>
<ul>
<li>Start up the VMWare Infrastructure client and log into your ESXi server </li>
<li>Select Configuration-&gt;Storage and choose &quot;Add Storage&quot; </li>
<li>Choose &quot;Network File System&quot; </li>
<li>For the &quot;server&quot;, enter the IP address of your Ubuntu server </li>
<li>For the &quot;folder&quot;, enter the full path to the exported folder: /backups </li>
<li>For the &quot;datastore name&quot;, enter a meaningful name, e.g. nfsshare </li>
<li>You should now be able to see the remote NFS share from the ESXi host. </li>
</ul>
<h4>Backing up your VM</h4>
<p>I prefer to perform this step from the command line rather than using the VMWare Infrastructure Client. Again, your mileage may vary. </p>
<ul>
<li>Log onto your VM console by hitting &lt;ctrl&gt;-&lt;alt&gt;-F1 at the VMWare splash screen, typing &quot;unsupported&quot; (without the quotes) and hitting return, and then entering your password. </li>
<li>Move to the volumes directory with
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:3fd2c4e8-3cac-408b-ad25-cbdba914d43e" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">cd /vmfs/volumes</pre>
</div>
</li>
<li>Note that this directory contains subdirectories of the physical drives, and the NFS share </li>
<li>Ensuring the power is off on the VM you are about to back up, copy the VMs directory to the NFS share. For example:
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:7ba836a8-7fab-4075-97ea-12caf920c6d5" class="wlWriterEditableSmartContent">
<pre class="brush: text;">cp Disk2/MyVM nfsshare -R</pre>
</div>
</li>
<li>Check the folder has been copied to the target share, et voila &#8211; you have a backup of your VM </li>
</ul>
<h4>Restoring your VM</h4>
<p>Again, I prefer to perform this step from the command line rather than using the VMWare Infrastructure Client. This is the same approach as backing up, except that we copy the files from the NFS share to the ESXi server. In other words, the last step becomes something like: </p>
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:2b7d5098-2448-4f2c-a871-38c76ee90818" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">cp Disk2/MyVM nfsshare -R</pre>
</div>
<p>Finally, you may need to re-import the restored VM into the ESXi host&#8217;s inventory. To do this, navigate to the VMs files in the VMWare Infrastructure Client, right click on the VMs .vmx file, and click &quot;Add To Inventory&quot; </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.roblevine.co.uk/?feed=rss2&amp;p=41</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Imaging a hard disk over the network</title>
		<link>http://blog.roblevine.co.uk/?p=26</link>
		<comments>http://blog.roblevine.co.uk/?p=26#comments</comments>
		<pubDate>Mon, 19 Apr 2010 21:47:33 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Backups]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://blog.roblevine.co.uk/?p=26</guid>
		<description><![CDATA[Yesterday I found myself in need of a way to take a hard disk image over the network; that is to image the entire hard disk to a remote Windows share.
It&#8217;s not a particularly complex operation, but it is nevertheless useful, so I thought I&#8217;d jot down the steps here.
In my case, the machine is [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I found myself in need of a way to take a hard disk image over the network; that is to image the entire hard disk to a remote Windows share.</p>
<p>It&#8217;s not a particularly complex operation, but it is nevertheless useful, so I thought I&#8217;d jot down the steps here.</p>
<p>In my case, the machine is a Windows Server (2008), which I am about to rebuild. However, I wanted to be able to image the drive so I could roll back to its current state the new install didn’t go so well.</p>
<p>Bear in mind, my aim is to take a snapshot of the drive, a “saved position”, in its current state so I can restore it. I am not attempting to take a backup with a view to being able to retrieve data from it later. As such I am taking an image of the whole drive, rather than creating mountable copies of the individual partitions or backing up files.</p>
<p>Anyway – to perform this imaging, I downloaded the latest Ubuntu as a &quot;Live CD&quot; with the aim of using this to snapshot the drive to a remote Windows share.</p>
<p>Here are the steps I went through:</p>
<ol>
<li>Create a share on the target machine (another Windows box) and set share and NTFS permissions to grant read/write permissions a user on the windows box.
</li>
<li>Boot the computer containing the drive to be backed up into Ubuntu
</li>
<li>Optionally zero the free space on the drive (see below for more information)
</li>
<li>Install smbfs; this allows you to mount windows shares:
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:19754e58-ac05-492a-9639-0c150cd0eb5d" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">sudo apt-get install smbfs </pre>
</div>
</li>
<li>Create a mount point for the remote Windows share:
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:956b07b5-cb32-4812-85f5-c061376b7932" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">cd /mnt
sudo mkdir remoteshare</pre>
</div>
</li>
<li>Mount your remote Windows share:&#160;
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:d8298808-66ed-4609-aeb3-7cee3552b762" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">sudo smbmount //&lt;remoteip&gt;/&lt;remoteshare&gt; /mnt/remoteshare -o username=&lt;username&gt;</pre>
</div>
<p>in my case:</p>
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:76f0178b-5338-4538-be4c-92123640bcaa" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">sudo smbmount //192.168.1.1/backup /mnt/remoteshare -o username="robert levine"</pre>
</div>
</li>
<li>Finally, perform the disk image itself:
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:9b544a55-d29e-47ba-82c0-cee8fb659d1c" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">sudo dd if=/dev/sda | gzip &gt; /mnt/remoteshare/backup.dd.gz</pre>
</div>
<p>[Note: the drive I am backing up is /dev/sda. Your drive may not be – <strong>YOU MUST USE THE CORRECT DRIVE DEVICE NAME OR YOU WILL BACKUP THE WRONG DISK!</strong>] </p>
</li>
</ol>
<h3>Restoring the backup</h3>
<p>To restore the backup, you follow the same steps of booting the target into Ubuntu, installing “smbfs”, creating the mount point for the remote share, and mounting the remote share.</p>
<p>To actually perform the restore, issue the following command:</p>
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:0b6b365f-ac1b-46b4-b169-1d7d1b1d5e14" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">gzip -dc /mnt/remoteshare/backup.dd.gz | dd of=/dev/sda</pre>
</div>
<p>As with the backup – I am restoring to the drive /dev/sda. Your drive may not be – <strong>YOU MUST USE THE CORRECT DRIVE DEVICE NAME OR YOU WILL RESTORE OVER WRONG DISK AND LOSE YOUR DATA!</strong></p>
<p>Well – it worked for me!</p>
<h3>Zeroing the free space on the drive</h3>
<p>You may wish to zero the free space on the drive. The backup operation detailed above uses gzip to compress the drive image. However, if your drive has a lot of free space that previously contained files, then this space will actually contain the detritus of those files and may not be very compressible.</p>
<p>In my case it contained many tens of gigs of mp3s that had been deleted. Writing a single file of zeros will clean this and make the free space compressible.</p>
<p>However, it may also take a bit of time <img src='http://blog.roblevine.co.uk/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Here are the steps &#8211; you may want to repeat this for each partition on the drive. In my case, only one partition contained &quot;deleted free space&quot;: /dev/sda2, so this is the only one I zeroed</p>
<ol>
<li>Create a mount point for the source drive in Ubuntu:
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:53527359-f13a-4c07-8342-571bd4f7547d" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">cd /mnt
sudo mkdir srcdrive</pre>
</div>
</li>
<li>Mount the source drive:
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:d7c21918-a591-4839-acb7-bd9a7c2e3533" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">sudo mount /dev/X /mnt/srcdrive</pre>
</div>
<p>where X is the device name and partition of the drive (in my case &quot;sda2&quot;). If you are not sure what the device name of the drive is, try listing all the drives/partitions on the system with:</p>
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:db5a5fd7-910e-44a3-a8bb-4734467e37d2" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">sudo fdisk -l</pre>
</div>
<p>This should help you to identify the drive/partition</p>
</li>
<li>Once you have mounted the drive, write your zero file (this may take some time):
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:c5adf55e-e931-4314-b3c4-fb78b6a52493" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">cat /dev/zero &gt; /mnt/sourcedrive/zero.dat</pre>
</div>
<p>This may take some time!</p>
</li>
<li>Once done, delete the file, and then unmount the drive:
<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:f32c3428-b7e9-4f15-a8ea-c502c7ff2e88:875dc219-f866-4ac2-bceb-97ac9619d1ba" class="wlWriterEditableSmartContent">
<pre class="brush: bash;">sudo rm /mnt/srcdrive/zero.dat sudo umount /mnt/srcdrive</pre>
</div>
</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://blog.roblevine.co.uk/?feed=rss2&amp;p=26</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Vista install disc and the .wim files</title>
		<link>http://blog.roblevine.co.uk/?p=22</link>
		<comments>http://blog.roblevine.co.uk/?p=22#comments</comments>
		<pubDate>Thu, 17 Apr 2008 08:38:02 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Rant]]></category>
		<category><![CDATA[Windows]]></category>

		<guid isPermaLink="false">http://blog.roblevine.co.uk/?p=22</guid>
		<description><![CDATA[Yesterday I needed to reinstall a Windows font that seemed to be misbehvaing. Over the years I have built up a fair bit of knowledge about Windows, so I thought I knew what to do:

Delete the font from \Windows\fonts
Locate the i386 directory on the install media for Windows (or my local drive if I ever [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I needed to reinstall a Windows font that seemed to be misbehvaing. Over the years I have built up a fair bit of knowledge about Windows, so I thought I knew what to do:</p>
<ol>
<li>Delete the font from <span class="code">\Windows\fonts</span></li>
<li>Locate the <span class="code">i386</span> directory on the install media for Windows (or my local drive if I ever copied it over)</li>
<li>Find the compressed font in question; e.g. for Wingdings, the compressed file would be <span class="code"><strong>WINGDINGS.TT_</strong></span></li>
<li>Use the command line <span class="code">expand.exe</span> tool to expand this to <span class="code"><strong>WINGDINGS.TTF</strong></span></li>
<li>Install the font by copying it to the <span class="code">\Windows\fonts</span> directory.</li>
</ol>
<p>If I remember correctly, the process has been the same for a long time. Back in the Windows 9x days there was no i386 (this was of Windows NT provenance), so you&#8217;d have to look in the cabinet (<span class="code">.cab</span>) files. Before that, in the Win3.x/DOS days, you&#8217;d have to search the install media for any files you wished to reinstall. But overall, the process has been with us for some time, and the repository of files that is the <span class="code">i386</span> folder was the place to start.</p>
<p>The problem was that my Vista install did not contain a <span class="code">i386</span> directory. As a perused the directory structure of the DVD I discovered a directory named <span class="code">sources</span> which contained a fair few files, but not nearly enough for a Windows install, and it didn&#8217;t contain the file I was looking for. In fact, the entire DVD didn&#8217;t seem to contain the file I was looking for. In fact, now I&#8217;d come to ask the question: &#8220;WHERE ARE ALL THE BL**DY INSTALL FILES?&#8221;.</p>
<p>And then I noticed <span class="code">install.wim</span>, all 2,412,507,182 bytes of it (that is 2.24GB folks). </p>
<p>It is bound to be in there. It must be like the <span class="code">.cab</span> file of yesteryear.</p>
<p>So I double-clicked it; no default viewer for this file type.</p>
<p><a href="http://www.winzip.com">WinZip</a> &#8211; &#8220;Not a valid archive&#8221;.</p>
<p><a href="http://www.ezbsystems.com/ultraiso/">UltraISO</a> (great utility) &#8211; &#8220;Invalid or unknown image file format&#8221;.</p>
<p>And then I thought &#8220;hang on &#8211; there must be a userspace tool that ships with Windows for this file format, even if it is command line&#8221;; after all, who&#8217;d ship an O.S. in a format that you can&#8217;t access unless you were the installer program? Clearly that would be silly.</p>
<p>So I searched my Windows installation, and I searched the install disc, and then I searched the web for info. </p>
<p>No.</p>
<p>Either I missed something obvious, or there is no quick way to get at the files contained in a <span class="code">.wim</span> image file.</p>
<p>From all my searching, apart from the odd shareware utility that <em>might</em> have been able to do it, I found one official answer; the <span class="code">IMAGEX.EXE</span> utility. A Microsoft tool that lets you mount a <span class="code">.wim</span> file as a drive.</p>
<p>And where was this tool?</p>
<p>Available as a free download in the <a href="http://www.microsoft.com/downloads/details.aspx?FamilyID=C7D4BC6D-15F3-4284-9123-679830D629F2&amp;displaylang=en">Windows Automated Install Kit</a>.</p>
<p>And how big is this download?</p>
<p>Glad you asked.</p>
<p>It is 992.2MB.</p>
<p>3 observations to make here:</p>
<ul>
<li>If I am right and there really isn&#8217;t a userspace tool to read <span class="code">.wim</span> files that comes with Windows then that is <b>absolutely pathetic</b>.</li>
<li>I&#8217;m glad I got my broadband speed doubled last week.</li>
<li>Naturally the font I resinstalled didn&#8217;t fix the problem, so I still can&#8217;t print Franklin Gothic Book type in bold to my HP Photosmart 3310 printer.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.roblevine.co.uk/?feed=rss2&amp;p=22</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Implementing a simple hashing algorithm (pt II)</title>
		<link>http://blog.roblevine.co.uk/?p=19</link>
		<comments>http://blog.roblevine.co.uk/?p=19#comments</comments>
		<pubDate>Thu, 03 Apr 2008 15:27:15 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[.Net]]></category>
		<category><![CDATA[Gotchas]]></category>
		<category><![CDATA[Hash codes]]></category>

		<guid isPermaLink="false">http://blog.roblevine.co.uk/?p=19</guid>
		<description><![CDATA[In my last blog article I looked at implementing a hashing algorithm by trying different boolean mathematical operations on the constituent fields of our class. It was very clear that out of AND, OR and XOR, only XOR provided us with anything like a balanced hash code. However, although it worked well in the previous [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://blog.roblevine.co.uk/?p=15">last blog article</a> I looked at implementing a hashing algorithm by trying different boolean mathematical operations on the constituent fields of our class. It was very clear that out of AND, OR and XOR, only XOR provided us with anything like a balanced hash code. However, although it worked well in the previous example (a music library), all is not quite as it seems. On closer inspection of the behaviour of the XOR hash, it turns out that this hashing algorithm has its own flaws and is not ideal in most situations.</p>
<p>&#160;</p>
<h3><b>The Commutativity Issue</b></h3>
<p>The most obvious problem with the &quot;XOR&quot; all fields approach to hash codes is that any two fields XOR&#8217;d together will give you the same value, regardless of the order in which they are XOR&#8217;d (i.e. XOR is a commutative operation). This increases the chance that you will get hash code collision, which of course is a <em>bad</em> thing. Consider the following example (all fields contributing to hash):</p>
<table>
<thead>
<tr>
<th>Forename</th>
<th>Middle name</th>
<th>Surname</th>
</tr>
</thead>
<tbody>
<tr>
<td>John</td>
<td>Paul</td>
<td>Jones</td>
</tr>
<tr>
<td>Paul</td>
<td>John</td>
<td>Jones</td>
</tr>
</tbody>
</table>
<p>Clearly the two people above are not the same person, but they will have the same hashcode if we generate our hash with a simple XOR of the hash codes of the constituent fields.</p>
<p>As mentioned in previous posts, a hashcode is <em>not</em> a unique identifier, and the fact both people have the same hash-code won&#8217;t <em>break</em> a hashtable. However, it will lead to a less efficient hashtable, as both our people will end up in the same bucket of the hashtable when, ideally, they shouldn&#8217;t.</p>
<p>&#160;</p>
<h3><b>The Range Issue</b></h3>
<p>Another problem that the XOR approach doesn&#8217;t address is that of the potential range of values of a hashcode. In my previous post I showed that the XOR implementation of hash code seemed acceptable for my music library example. However, in reality I was relying on the properties of the individual hash codes that made up my overall hash code. Specifically, many of the fields that contributed to the hash were strings, and the BCL implementation of <span class="code">System.String.GetHashCode()</span> has a pretty good distribution. Had my music track entity not contained several string fields, things would have looked very different.</p>
<p>But what about fields that may have a poor &quot;innate&quot; distribution? Given that <span class="code">System.Integer.GetHashCode()</span> returns the integer itself, what happens if I have an field representing a person&#8217;s age? The spread of values is, at best, 0 &lt; age &lt; 120, which is hardly the distribution across integer-space that you might want. A person&#8217;s height in cm? 0 &lt; height &lt; 250.</p>
<p>You see the problem? I don&#8217;t really want to be combining all these hash codes in the very low integer ranges using the XOR approach because it means my final &quot;cumulative&quot; hash code will be stuck within this range as well.</p>
<p>&#160;</p>
<h3><b>Examining the flaws in more detail</b></h3>
<p>I don&#8217;t pretend to be an expert on hashing algorithms, but I can see that we have issues here and so the best way of discovering more (for me, at least) is to work through a broken example and see what I can do to fix it.</p>
<p>I very quickly settled on the idea of trying to write a replacement <span class="code">String.GetHashCode()</span> implementation. I would get a dictionary of words (all in lower case), and try and write a hash code implementation based on the ASCII code of each letter in the word. Given that the the hashcode of our ASCII codes would be the ASCII code itself (since the hashcode of an integer is the integer itself), we would have a poor distribution of individual character hash codes; all in the range 97 (a) &#8211; 122 (z). This approach would also highlight problems such as commutativity (since many words have the same letters, but just in a different order).</p>
<p>My XOR implementation for this approach would look like this:</p>
<pre class="csharpcode"><span class="kwrd">public</span> <span class="kwrd">int</span> GetHashCode(<span class="kwrd">string</span> word)
{
    <span class="kwrd">char</span>[] chars = word.ToCharArray();
    <span class="kwrd">int</span> hash = 0;
    <span class="kwrd">foreach</span> (<span class="kwrd">char</span> c <span class="kwrd">in</span> chars)
    {
        <span class="kwrd">int</span> ascii = (<span class="kwrd">int</span>)c;
        hash ^= ascii.GetHashCode();
    }
    <span class="kwrd">return</span> hash;
}</pre>
<p>I sourced my dictionary from <a href="http://www.dicts.info/uddl.php">here</a>, and de-duplicated it (and converted the words to lower case) to produce <a href="http://blog.roblevine.co.uk/wp-content/uploads/2008/04/EnglishDict.txt">this</a> list of words.</p>
<p>As expected, this algorithm (referred to as <strong>AsciiChar_XOR</strong> in the diagram) has a spectacularly bad value distribution, as shown in this histogram:</p>
<p><img src="http://blog.roblevine.co.uk/wp-content/uploads/2008/04/hashCodeHistogram_ASCIIXOR.PNG" alt="Histogram for ASCIIXOR."/> </p>
<p>[Note that the width of each bucket here is 67108864, being 2^32 / 64]</p>
<p>Surprise, surprise &#8211; all 2898 words fall into the same histogram bucket! In fact they all fall into the far narrower range of 0-127 &#8211; the ASCII code range.</p>
<p>All of a sudden the simple XOR approach to hashing looks like a very poor performer indeed.</p>
<p>&#160;</p>
<h3><b>Examining better algorithms</b></h3>
<p>Since we&#8217;ve already discussed the weaknesses of XOR, we should have a fairly good idea of where to focus our attention to create better algorithms. Firstly we should be choosing an algorithm that is non-commutative, and secondly we should be choosing an algorithm that uses the full range of integer-space, rather than limiting a hashcode to the range of its constituent members.</p>
<p>A bit of a search around reveals two approaches that are often discussed. The first one, given to me as a boiler-plate example by a Java developer friend, looks like this:</p>
<pre class="csharpcode"><span class="kwrd">public</span> <span class="kwrd">override</span> <span class="kwrd">int</span> GetHashCode()
{
    <span class="kwrd">int</span> hash = 23;
    hash = hash * 37 + (field1 == <span class="kwrd">null</span> ? 0 : field1.GetHashCode());
    hash = hash * 37 + (field2 == <span class="kwrd">null</span> ? 0 : field2.GetHashCode());
    hash = hash * 37 + (field3 == <span class="kwrd">null</span> ? 0 : field3.GetHashCode());
    <span class="kwrd">return</span> hash;
} </pre>
<p>[I shall refer to this type of algorithm as <strong>JavaStyleAddition</strong> as it seems to be a very common implementation in the Java world]</p>
<p>The second common pattern looks something like this:</p>
<pre class="csharpcode"><span class="kwrd">public</span> <span class="kwrd">override</span> <span class="kwrd">int</span> GetHashCode()
{
    <span class="kwrd">int</span> hash = 23;
    hash = (hash &lt;&lt; 5) + (field1 == <span class="kwrd">null</span> ? 0 : field1.GetHashCode());
    hash = (hash &lt;&lt; 5) + (field2 == <span class="kwrd">null</span> ? 0 : field2.GetHashCode());
    hash = (hash &lt;&lt; 5) + (field3 == <span class="kwrd">null</span> ? 0 : field3.GetHashCode());
    <span class="kwrd">return</span> hash;
} </pre>
<p>[I shall refer to this type of algorithm as <strong>ShiftAdd</strong>]</p>
<p>Both of these have some similar characteristics. By taking the cumulative hash so far, applying an operation (multiply for JavaStyleAddition and left-shift-5 for ShiftAdd) and then adding the new field&#8217;s hash code, they both avoid the commutativity issue since</p>
<p><span class="code">(x * n) + y</span> is not <em>generally</em> equal to <span class="code">( y * n ) + x</span>.</p>
<p>They also both increase the range of the cumulative hash code above that of the constituent hash codes due to the effect of multiplying the cumulative hash code each time (remember that a single left shift is a multiplication by two).</p>
<p>You will also notice that both approaches use a prime number (23 in the examples shown) as the starting value for the hash, and JavaStyleAddition uses another prime (37) as the multiplier. My guess is that this, statistically, makes collisions less likely as you multiply up your hash code because if one side of the multiplication has no factors (other than 1 and itself), then you are lowering the statistical average number of factors of the result. Of course, I may be wrong about that :-s</p>
<p>A variant of ShiftAdd that I have seeing during my Google journeys is one in which the hash codes are shifted and then XOR&#8217;d (rather than added). I shall refer to this as <strong>ShiftXOR</strong>. </p>
<p><img src="http://blog.roblevine.co.uk/wp-content/uploads/2008/04/hashCodeHistogram_Compare_SX_JSA_SA.PNG" alt="Histogram for SX_JSA_SA."/> </p>
<p>This certainly looks better than the histogram for AsciiChar_XOR <img src='http://blog.roblevine.co.uk/wp-includes/images/smilies/icon_biggrin.gif' alt=':-D' class='wp-smiley' /> </p>
<p>However, all three algorithms still cluster around the centre of the number range, and all exhibit other major spikes in distribution.</p>
<p>What I could have done at this point is break into a major mathematical and statistical analysis of these three hashing algorithms, but I decided against it for two key reasons. Firstly &#8211; I would have got bored, and secondly &#8211; I wouldn&#8217;t have had the first idea where to start!</p>
<p>Nope &#8211; I felt it better to fall back on my hacker instincts and munge various forms of the above algorithms to see if I could produce a better algorithm for my particular use case.</p>
<p>I quickly came up with two further algorithms, both combining the left-shift approach with the the prime-add approach of the above algorithms. The difference between them being only that one adds the hashcodes each time, while the other XORs them:</p>
<p>&#160;</p>
<pre class="csharpcode"><span class="kwrd">public</span> <span class="kwrd">override</span> <span class="kwrd">int</span> GetHashCode()
{
    <span class="kwrd">int</span> hash = 23;
    hash = ((hash &lt;&lt; 5) * 37 ) + (field1 == <span class="kwrd">null</span> ? 0 : field1.GetHashCode());
    hash = ((hash &lt;&lt; 5) * 37 ) + (field2 == <span class="kwrd">null</span> ? 0 : field2.GetHashCode());
    hash = ((hash &lt;&lt; 5) * 37 ) + (field3 == <span class="kwrd">null</span> ? 0 : field3.GetHashCode());
    <span class="kwrd">return</span> hash;
} </pre>
<p>[<strong>ShiftPrimeAdd]</strong></p>
<p>and</p>
<p>&#160;</p>
<pre class="csharpcode"><span class="kwrd">public</span> <span class="kwrd">override</span> <span class="kwrd">int</span> GetHashCode()
{
    <span class="kwrd">int</span> hash = 23;
    hash = ((hash &lt;&lt; 5) * 37) ^ (field1 == <span class="kwrd">null</span> ? 0 : field1.GetHashCode());
    hash = ((hash &lt;&lt; 5) * 37) ^ (field2 == <span class="kwrd">null</span> ? 0 : field2.GetHashCode());
    hash = ((hash &lt;&lt; 5) * 37) ^ (field3 == <span class="kwrd">null</span> ? 0 : field3.GetHashCode());
    <span class="kwrd">return</span> hash;
} </pre>
<p>[<strong>ShiftPrimeXOR]</strong></p>
<p>For good measure, I threw these into the mix alongside the standard BCL implementation of <span class="code">System.String.GetHashCode()</span> [labelled as <strong>StringGetHashCode</strong> in the diagram] to see how they would fare:</p>
<p><img src="http://blog.roblevine.co.uk/wp-content/uploads/2008/04/hashCodeHistogram_Compare_SPX_SGHC_SPA.PNG" alt="Histogram for SPX_SGHC_SPA."/> </p>
<p>[Note that the y-axis for this histogram is half that of the previous diagram]</p>
<p>Now &#8211; that is MUCH more like it. We have a well balanced hash code distribution across the entire range of integer space. There are a few minor spikes, but all three seem to compare favourably with each other. The approach of including the prime and &#8216;multiplying&#8217; up each time really does seem to do the trick.</p>
<p>Which would I chose out of ShiftPrimeXOR and ShiftPrimeAdd? Not sure &#8211; I&#8217;d have to benchmark them first and see which was fastest!</p>
<p>&#160;</p>
<h3><b>Conclusion</b></h3>
<p>In summary, just XORing fields together may well produce an <em></em>awful<em></em> hash code distribution, unless the constituent field hash codes are themselves well balanced.</p>
<p>However, during the course of writing this article, I have realised that there are other relatively simple implementations that provide a good hash code distribution (for this example at least). More than that, I have reinforced my belief that these things are best checked out if you have any doubts. It doesn&#8217;t take long to put together a test harness and profile your algorithms with a sample of your data.</p>
<p>One thing I have omitted from my investigation is any discussion on the speed of the algorithms. It would be worth benchmarking each one because if a class is being used in a hashtable, its <span class="code">.GetHashCode()</span> method is called for each <span class="code">.Add</span>, <span class="code">.Remove</span> and <span class="code">.Contains</span> call. However, the following thought does occur to me; with these sort of repetitve mathematical operations, the relative speed of XOR vs. shift vs. multiply (etc) may <em>well</em> actually depend on your CPU architecture.</p>
<p>On reflection, there is a lot more to hashing than the small amount I know and I&#8217;m sure many mathematical research papers have been written on the subject.</p>
<p>In the future, my default choice will probably lean towards ShiftPrimeXOR or ShiftPrimeAdd as a starting point. It would be a waste of time to spend days up front trying to work out the perfect hashing algorithm. My approach would be to choose one, use it, and keep an eye on its performance. If it proves to problematic then consider optimising it, otherwise leave it alone.</p>
<p>Right &#8211; enough already about hashing algorithms!</p>
<p>A Visual Studio 2008 project containg the console application I used to generate these results can be found <a href="http://blog.roblevine.co.uk/wp-content/uploads/2008/04/StringHashingAlgorithms.zip">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.roblevine.co.uk/?feed=rss2&amp;p=19</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Implementing a simple hashing algorithm</title>
		<link>http://blog.roblevine.co.uk/?p=15</link>
		<comments>http://blog.roblevine.co.uk/?p=15#comments</comments>
		<pubDate>Fri, 14 Mar 2008 16:15:41 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[.Net]]></category>
		<category><![CDATA[Gotchas]]></category>
		<category><![CDATA[Hash codes]]></category>
		<category><![CDATA[Rant]]></category>
		<category><![CDATA[Windows]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://blog.roblevine.co.uk/?p=15</guid>
		<description><![CDATA[In my last blog article I outlined a gotcha whereby a developer overrides .Equals() without providing a similarly meaningful override to .GetHashCode(). I gave a description, and illustration, of why it is so important to ensure that the same fields are considered for the two methods.
In this article I discuss a simple strategy for providing [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://blog.roblevine.co.uk/?p=13">last blog article</a> I outlined a gotcha whereby a developer overrides <span class="code">.Equals()</span> without providing a similarly meaningful override to <span class="code">.GetHashCode()</span>. I gave a description, and illustration, of why it is so important to ensure that the same fields are considered for the two methods.</p>
<p>In this article I discuss a simple strategy for providing a <span class="code">.GetHashCode()</span> implementation and compare it to some flawed equivalents. As with the previous blog article, my primary reason for writing about this is that I&#8217;ve seen some rather poor implementations of this, but it isn&#8217;t really complicated or difficult to come up with a reasonable solution.</p>
<p>Referring back to the same three requirements discussed previously in the <a href="http://msdn2.microsoft.com/en-us/library/system.object.gethashcode.aspx">MSDN object.GetHashCode documentation</a>, the point that is most pertinent to this article is point three (since we&#8217;ve already discussed points one and two):</p>
<p><font face="Times New Roman">A hash function must have the following properties: </font></p>
<ul>
<li>
<p><font face="Times New Roman">If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.</font></p>
</li>
<li>
<p><font face="Times New Roman">The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object&#8217;s </font><a href="http://msdn2.microsoft.com/en-us/library/d5171fyh.aspx"><font face="Times New Roman">Equals</font></a><font face="Times New Roman"> method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.</font></p>
</li>
<li>
<p><font face="Times New Roman"><strong>For the best performance, a hash function must generate a random distribution for all input.</strong></font></p>
</li>
</ul>
<p>Points one and two above are <strong>absolute rules</strong> regarding the behaviour of .GetHashCode(). If you break either of these, then consumers of your hashcode (e.g. a hashtable) will not function correctly. As discussed in the previous article, a broken implementation may cause a hashtable to forever lose instances of your key placed within it and then report that it does not contain the key when it does. However, long as points one and two are obeyed, then your hash code (and hence hashtables) will work.</p>
<p>Point three is a recommendation. If you fail to adhere to this point, a consumer of your hash code will still function, it just won&#8217;t function very well.</p>
<p>&nbsp;</p>
<h3><b>A functional (but very bad) hashing algorithm</b></h3>
<p>Consider the following implementation:</p>
<pre class="csharpcode">
<span class="kwrd">public</span> <span class="kwrd">override</span> <span class="kwrd">int</span> GetHashCode()
{
    <span class="kwrd">return</span> 1;
}
</pre>
<p>This implementation does behave correctly with regard to the first two points above. If two objects are equal, then the method <em>does</em> return the same value; it just so happens that it also returns this value if they are not equal. That is fine and allowable &#8211; we are implementing a hash code, <em>not</em> a unique identifier. Point one above is satisfied. Additionally, our object will consistently return the same hash code (whether or not state has been modified). Point two above is satisfied.</p>
<p>It fails on point three though. Far from generating a random distribution for all input, it always returns the same number. Using this implementation will mean that all instances in a hashtable end up in the same bucket. This, consequently, means that the hashtable will have to do a <span class="code">.Equals()</span> check on the full set of data to find a match. In other words, we have just reduced our hashtable to nothing more than a simple list (e.g. and ArrayList). No better (in fact probably marginally worse) than doing a <span class="code">foreach</span> loop over the data set and manually looking for a match!</p>
<p>&nbsp;</p>
<h3><b>Other approaches to hashing</b></h3>
<p>In order to examine possible hashing algorithms more closely, I decided to take a real world set of data, and use this as a basis of further investigation. The data set that sprang to mind was my music library, probably because I was staring at <a href="http://www.winamp.com">Winamp</a> while trying to think of a suitable data source!</p>
<p>My music collection consists of a number of music tracks; for each of these there are several pieces of metadata. I exported my music library as iTunes xml and created the following interface to represent a single music track:</p>
<pre class="csharpcode"><span class="kwrd">public</span> <span class="kwrd">interface</span> IMusicTrackFile
{
    <span class="kwrd">string</span> TrackName { get; set; }
    <span class="kwrd">string</span> AlbumName { get; set; }
    <span class="kwrd">string</span> Artist { get; set; }
    <span class="kwrd">string</span> Format { get; set; }
    <span class="kwrd">int</span>? Year { get; set; }
    <span class="kwrd">int</span>? BitRate { get; set; }
    <span class="kwrd">int</span>? Size { get; set; }
    <span class="kwrd">int</span>? Time { get; set; }
    <span class="kwrd">int</span>? PlayCount { get; set; }
}</pre>
<p>The aim of this interface is to represent track information that can then be stored as the key within a hashtable. Note that I am writing the word &#8220;hashtable&#8221; in lower case, because I am thinking of the general concept of a hashtable collection; in the .Net framework this includes <span class="code">System.Collections.Hashtable</span>, as well as the generic <span class="code">System.Collections.Generic.Dictionary&lt;&gt;</span> collection, and others.</p>
<p>Given that I will consider two tracks to be equal only if all the properties on the music track instance are the same, then ideally I want my hash code to be based on all these properties. The obvious approach here is to use boolean maths to combine the hash codes from the constituent properties to give one overall hash code.</p>
<p>Three immediate choices spring to mind to combine my constituent hash codes into one overall hash code: AND, OR or XOR. It doesn&#8217;t take a rocket scientist to realise that the only choice of these three worth considering seriously is XOR.</p>
<p>Why? Because ANDing many integers together will converge on all bits being zero; the number zero itself:</p>
<p><img src="http://blog.roblevine.co.uk/images/TracksHashCodes_AND.PNG" alt="Hashcode graph for AND."/></p>
<p>Every single track here gave a hash-code of zero &#8211; that is a 100% hash code collision rate. No better, in this particular case, than our &#8220;return 1&#8243; implementation above.</p>
<p>ORing many integers together will converge on all bits being set; the number -1 in the world of two&#8217;s-compliment integer representation:</p>
<p><img src="http://blog.roblevine.co.uk/images/TracksHashCodes_OR.PNG" alt="Hashcode graph for OR."/></p>
<p>[Note: The Y-Axis range is 0 to -25 million here]</p>
<p>There is some distribution of values here, due to &#8220;flutter&#8221; in two or three bits; overall only 311 (out of 6142) unique values, and all in the negative number range. This is better than the AND scenario, but still hardly a good spread of values.</p>
<p>Only XOR gives anything like a decent spread of values:</p>
<p><img src="http://blog.roblevine.co.uk/images/TracksHashCodes_XOR.PNG" alt="Hashcode graph for XOR."/></p>
<p>[Note: The Y-Axis range here is +25 <strong>billion </strong>to -25 <strong>billion</strong>; over 2000 times larger than the OR case]</p>
<p>This is <em>much</em> more like it. 6133 unique values (out of 6142). Note that these numbers are also spread throughout the entire range of a signed integer.</p>
<p>&nbsp;</p>
<h3><b>Summary</b></h3>
<p>In the example above, XORing the hash-codes of all constituent fields that form part of an equality check produced a functional and well balanced hash-code. This may not always be the case, and I am not advocating using an &#8220;XOR everything blindly&#8221; approach, but it often does produce a very reasonable hash. It isn&#8217;t particularly difficult to output and plot hash code distributions for a sample range of your data, and see whether you have a reasonable algorithm.</p>
<p>Bear in mind, as well, that even a hashing algorithm that is <em>halfway</em> good is probably good enough to start off with. If your hashtables turn out not to be performing well enough, revisit the algorithm (rather than spend too much time indulging in <a href="http://blogs.msdn.com/ricom/archive/2006/07/26/a-nice-little-article-on-the-fallacy-of-premature-optimization.aspx">the root of all evil</a>)!</p>
<p>In case anyone is interested, my sample music library can be found <a href="http://blog.roblevine.co.uk/wp-content/uploads/2008/03/mymusic.zip">here</a>, and the Visual Studio 2005 solution I used during this analysis is <a href="http://blog.roblevine.co.uk/wp-content/uploads/2008/03/musiclibraryhashcodes.zip">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.roblevine.co.uk/?feed=rss2&amp;p=15</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>A cache key with a bad GetHashCode() is another name for a memory leak</title>
		<link>http://blog.roblevine.co.uk/?p=13</link>
		<comments>http://blog.roblevine.co.uk/?p=13#comments</comments>
		<pubDate>Fri, 07 Mar 2008 16:45:56 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[.Net]]></category>
		<category><![CDATA[Gotchas]]></category>
		<category><![CDATA[Hash codes]]></category>

		<guid isPermaLink="false">http://blog.roblevine.co.uk/?p=13</guid>
		<description><![CDATA[[With apologies to Raymond Chen for the title]
.GetHashCode() &#8211; the method exists on every single object you create; and yet I&#8217;ve come to the conclusion that a lot of developers neither know much about it, nor care about it. It is almost as though it is considered of no consequence.
Hash codes are not really very [...]]]></description>
			<content:encoded><![CDATA[<p>[With apologies to <a href="http://blogs.msdn.com/oldnewthing/archive/2006/05/02/588350.aspx">Raymond Chen</a> for the title]</p>
<p><span class="code">.GetHashCode()</span> &#8211; the method exists on every single object you create; and yet I&#8217;ve come to the conclusion that a lot of developers neither know much about it, nor care about it. It is almost as though it is considered of no consequence.</p>
<p>Hash codes are not really very complicated in principle, but far from being of no consequence, they are critical whenever you are designing a class that may be used as a key in a collection (like a hashtable or generic dictionary). Failure to implement this method correctly can cause havoc.</p>
<p><br/></p>
<h3><b>The tale of the bad hash code</b></h3>
<p>Some time ago, at a previous gig, we were developing a web application and we decided to that we would provide a crude cache for small amounts of data that were used repeatedly but never change (by <em>never</em> I mean that it changes so rarely that a recycle of the web app would be permissible; e.g. a list of U.S. counties). The aim here was to minimise the number of database calls for these small, frequently used and immutable sets of data. For reasons, lost in the mists of time, this cache was provided via a static <span class="code">Hashtable</span> (rather than, for instance, the System.Web.Caching.Cache).</p>
<p>We were building on some building blocks &#8220;donated&#8221; to us by another project, and one of these included a class called <span class="code">CacheKey</span>.</p>
<p>The idea was simple enough; each collection in the cache would have a name, (e.g. &#8220;USCounties&#8221;), and zero or more optional parameter fields (e.g. if you were caching U.S. counties in Texas <i>only</i> you might have name, &#8220;USCounties&#8221;, and a single field with the value &#8220;TX&#8221;).</p>
<p>The class looked something like this:</p>
<pre class="csharpcode"><span class="kwrd">public</span> <span class="kwrd">class</span> CacheKey
{
   <span class="kwrd">public</span> <span class="kwrd">string</span> Name;
   <span class="kwrd">public</span> <span class="kwrd">object</span>[] Fields;
}</pre>
<p>Whoever wrote this class had overridden the <span class="code">.Equals()</span> method in order to implement a notion of key equivalence; two keys are equal if their <span class="code">Name</span>s are equal, and their collection of <span class="code">Fields</span> also have equal contents (ignoring null checking on <font face="courier new">Fields</font> for brevity):</p>
<pre class="csharpcode"><span class="kwrd">public</span> <span class="kwrd">override</span> <span class="kwrd">bool</span> Equals(<span class="kwrd">object</span> obj)
{
    CacheKey supplied = obj <span class="kwrd">as</span> CacheKey;  

    <span class="kwrd">if</span> (supplied == <span class="kwrd">null</span>)
        <span class="kwrd">return</span> <span class="kwrd">false</span>;  

    <span class="kwrd">if</span> (<span class="kwrd">this</span>.Name != supplied.Name)
        <span class="kwrd">return</span> <span class="kwrd">false</span>;  

    <span class="kwrd">if</span> (<span class="kwrd">this</span>.Fields.Length != supplied.Fields.Length)
        <span class="kwrd">return</span> <span class="kwrd">false</span>;  

    <span class="kwrd">for</span> (<span class="kwrd">int</span> i = 0; i &lt; <span class="kwrd">this</span>.Fields.Length; i++)
    {
        <span class="kwrd">if</span> (<span class="kwrd">this</span>.Fields[i] != supplied.Fields[i])
        {
            <span class="kwrd">return</span> <span class="kwrd">false</span>;
        }
    }
    <span class="kwrd">return</span> <span class="kwrd">true</span>;
}</pre>
<p>However, they had overridden <span class="code">.GetHashCode()</span> with the following implementation:</p>
<pre class="csharpcode"><span class="kwrd">public</span> <span class="kwrd">override</span> <span class="kwrd">int</span> GetHashCode()
{
    <span class="kwrd">return</span> <span class="kwrd">base</span>.GetHashCode();
}</pre>
<p>It may be that no-one intentionally coded this method, but that they responded to the compiler warning &#8221; <span class="code">&#8216;MyProj.CacheKey&#8217; overrides Object.Equals(object o) but does not override Object.GetHashCode()</span> &#8221; by letting the IDE auto-generate the method (which gives exactly this implementation). </p>
<p>The problem is that, whether auto-generated or hand crafted, this is probably the worst implementation of <font face="courier new">GetHashCode</font> possible. (Not that I am blaming the IDE; I&#8217;m sure the motivation for this auto-complete is to allow your code to compile, not to provide a meaningful implementation).</p>
<p>&nbsp;</p>
<h3><b>Why is this implementation so bad?</b></h3>
<p>The golden rule of a hash code is that your GetHashCode implementation must match your Equals implementation. What does this mean? It means <u>exactly the same fields that contribute to the equality check must also contribute to the hash code</u>.</p>
<p>This is covered by the first two in the list of three requirements stated in the <a href="http://msdn2.microsoft.com/en-us/library/system.object.gethashcode.aspx">MSDN object.GetHashCode documentation</a>:</p>
<p><font face="Times New Roman">A hash function must have the following properties: </font></p>
<ul>
<li>
<p><font face="Times New Roman"><strong>If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.</strong></font></p>
</li>
<li>
<p><font face="Times New Roman"><strong>The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object&#8217;s </strong></font><a href="http://msdn2.microsoft.com/en-us/library/d5171fyh.aspx"><font face="Times New Roman"><strong>Equals</strong></font></a><font face="Times New Roman"><strong> method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.</strong></font></p>
</li>
<li>
<p><font face="Times New Roman">For the best performance, a hash function must generate a random distribution for all input.</font></p>
</li>
</ul>
<p>There is no need to regurgitate the many descriptions of <a href="http://en.wikipedia.org/wiki/Hashtable">how</a> <a href="http://msdn2.microsoft.com/en-us/library/system.collections.hashtable.aspx">hashtables</a> <a href="http://www.relisoft.com/book/lang/pointer/8hash.html">actually</a> <a href="http://www.sparknotes.com/cs/searching/hashtables/section1.html">work</a>, but the very lightweight overview is this:</p>
<ol>
<li>Keys in the hash table are stored in individual buckets to reduce the overall search space when looking for an item in the hash table. </li>
<li>The result of <span class="code">.GetHashCode()</span> determines which bucket a key is stored in (i.e. when you Add an item to the hashtable, it calls <span class="code">.GetHashCode()</span> on the key and uses the resulting value to determine which bucket to place the key in).</li>
<li>When a &#8220;contains&#8221; check [e.g. <span class="code">hashtable.Contains(newKey)</span>] is performed, the key being checked (newKey) has its <span class="code">.GetHashCode()</span> method called. Based on this result, the &#8220;contains&#8221; method determines which bucket any possible matches may exist in, and performs an equality check on every item in this bucket until a match is found. </li>
</ol>
<p>Point three means it is critical that two &#8220;equal&#8221; keys return the same hash code, otherwise the whole basis on which a hash table works comes crashing down. If we get a <em>different </em>value every time we call <span class="code">.GetHashCode()</span> on different instances of <strong><em>equal</em></strong> keys, then our hashtable will forever be looking in the wrong buckets to see if we have a match; inevitably it will keep finding that we don&#8217;t have a match.</p>
<p>To put it another way, imagine trying to find your name in the phone book if you couldn&#8217;t rely on your brain&#8217;s ability to determine the first letter of your surname.</p>
<p>If every time I called <span class="code">brain.GetFirstLetterOfSurname()</span> I got a random letter (instead of &#8220;L&#8221; for Levine), then most of the time I&#8217;d be trying to find the name &#8220;Levine&#8221; in the section for another letter.</p>
<p>&nbsp;</p>
<h3><b>What is the effect of this broken implementation?</b></h3>
<p>So, back to the original issue; what is the effect of calling <span class="code">base.GetHashCode()</span>?  Our class inherits directly from <span class="code">System.Object</span>, so we are calling <span class="code">System.Object</span>&#8217;s implementation of <span class="code">.GetHashCode()</span>. I don&#8217;t know the detail of it&#8217;s implementation, but calling it 5000 times, each for a new instance of a <span class="code">CacheKey</span> with a name of &#8220;USCounties&#8221; produced this:</p>
<p><img src="http://blog.roblevine.co.uk/images/BadGetHashCode.png" alt="Hashcode graph for broken implementation"/> </p>
<p>which is <em>bad</em>; a different value for each call. What we <em>wanted</em> to see was the same hash code being returned for every instance (since they are all equal). In other words we wanted this (a straight line):</p>
<p><img src="http://blog.roblevine.co.uk/images/GoodGetHashCode.png" alt="Hashcode graph for working implementation"/> </p>
<p>What is the net effect of this in our cache? When I first do a call to the cache to get my list of counties, the cache informs me the data isn&#8217;t already cached (i.e. a cache miss) and so a call to the database is made, the data is retrieved, and this new collection is added to the cache. This is correct behaviour. </p>
<p>But on the second call to get my list of counties, the cache <em>still</em> informs me the data isn&#8217;t already cached and so we go to the database <em>again</em>, retrieve the data and add it to the cache. And again on the third call, and so on.</p>
<p>This gives us two issues. Firstly my cache is broken; I am making a database call every time I request the data, which defeats the purpose of the cache. But <strong><em>much much</em></strong> worse than that &#8211; every time I <em>am</em> actually caching the data and then losing it in the cache. So after 5000 requests for the data, I have it cached 5000 times*:</p>
<p><strong>I have one great big memory leak</strong>.</p>
<p>The moral of the story here is don&#8217;t implement your own cache-key type classes unless you actually implement <span class="code">.Equals()</span> and <span class="code">.GetHashCode()</span> together in a compatible way. In fact, ninety-nine times out of a hundred you probably won&#8217;t need to do this; using a string or integer as a cache-key has no such issues &#8211; it only becomes an issue when you implement your own.</p>
<p>Implementing a simple, but reasonable hashing algorithm is fairly simple, and I&#8217;ll look at this in my next post.</p>
<p>* Due to the statistical likelihood of the occasional &#8220;random&#8221; hash code collision I may have the odd cache hit, so maybe I am only wasting 4998 instances instead of 4999!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.roblevine.co.uk/?feed=rss2&amp;p=13</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making the .Net XmlTextReader accept colons in element names.</title>
		<link>http://blog.roblevine.co.uk/?p=12</link>
		<comments>http://blog.roblevine.co.uk/?p=12#comments</comments>
		<pubDate>Thu, 06 Mar 2008 16:19:52 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[.Net]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://blog.roblevine.co.uk/?p=12</guid>
		<description><![CDATA[This started as an addendum, to What is a valid XML element name?, but then I discovered something that made it worth breaking out into a separate post!
Ayende added a comment to his blog (under my comment) to say that he tried the &#8216;bad&#8217; xml in question on three parsers and none of them could [...]]]></description>
			<content:encoded><![CDATA[<p>This started as an addendum, to <a href="http://blog.roblevine.co.uk/?p=11">What is a valid XML element name?</a>, but then I discovered something that made it worth breaking out into a separate post!</p>
<p>Ayende added a comment to his blog (under my comment) to say that he tried the &#8216;bad&#8217; xml in question on three parsers and none of them could handle it. Naturally I thought I&#8217;d have a quick go too.</p>
<p>First I tried with the .Net <span class="code">System.Xml.XmlDocument</span> class and the <span class="code">System.Xml.XmlTextReader</span> class and neither of these would handle the &#8220;double-colon&#8221; element names. Next I tried two commercial XML editors, XmlSpy and StylusStudio, both of which were happy to let it pass their well-formed check without complaint (and they <em>do</em> both start complaining if you add other non-allowed characters). I don&#8217;t know what parsers either of these products are built on, but on the surface they seemed to be more compliant than .Net</p>
<p>Or so it appeared. One thing I noticed was that both <span class="code">System.Xml.XmlDocument</span> class and <span class="code">System.Xml.XmlTextReader</span> classes barf with the same exception, being raised from within <span class="code">System.Xml.XmlTextReaderImpl.ParseElement()</span>.</p>
<p>A quick look at this method using Lutz Roeder&#8217;s excellent <a href="http://www.aisto.com/roeder/dotnet/">Reflector</a> revealed something new and interesting. This class (<span class="code">System.Xml.XmlTextReaderImpl</span>) has an internal boolean property, <span class="code">Namespaces</span>, which changes the behaviour of this element parsing method to allow or disallow multiple colons. This makes sense when you think about it; if you don&#8217;t support namespaces then there is no issue with multiple colons. If you do support namespaces then the colon is reserved to separate the namespace prefix from the element&#8217;s local name. It is this very point that the XML RFC refers to regarding colons, and which I quoted in the previous article.</p>
<p>A closer look still revealed that a <span class="code">Namespaces</span> property is exposed on the <span class="code">System.Xml.XmlTextReader</span> class. And guess what? Setting this property to false allows the reader to start accepting &#8220;multi-colon&#8221; element names! Well &#8211; that is certainly a new one to me.</p>
<p>However, I couldn&#8217;t find an equivalent way of changing the <span class="code">System.Xml.XmlDocument</span>&#8217;s behaviour to accept this type of xml. To be honest, I&#8217;m not too bothered, because I can&#8217;t imagine using this particular style of xml any time soon!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.roblevine.co.uk/?feed=rss2&amp;p=12</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What is a valid XML element name?</title>
		<link>http://blog.roblevine.co.uk/?p=11</link>
		<comments>http://blog.roblevine.co.uk/?p=11#comments</comments>
		<pubDate>Wed, 05 Mar 2008 17:43:26 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[.Net]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://blog.roblevine.co.uk/?p=11</guid>
		<description><![CDATA[Ayende Rahien has noted some peculiar looking XML that is output by Subversion.
Specifically, he takes issue with a start tag of
&#60;C:bugtraq:label&#62;
Well &#8211; I can honestly say that I&#8217;ve never seen XML like this before (and I&#8217;ve been using XML since the late Cretaceous Period) but I&#8217;m not so sure that it is actually wrong! Although [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://ayende.com/Blog/archive/2008/03/04/Subversion-is-Xmlish.aspx">Ayende Rahien</a> has noted some peculiar looking XML that is output by Subversion.</p>
<p>Specifically, he takes issue with a start tag of</p>
<p><span class="codeblock">&lt;C:bugtraq:label&gt;</span></p>
<p>Well &#8211; I can honestly say that I&#8217;ve never seen XML like this before (and I&#8217;ve been using XML since the late <a href="http://en.wikipedia.org/wiki/Cretaceous_Period">Cretaceous Period</a>) but I&#8217;m not so sure that it is actually <em>wrong</em>! Although I too balk at the sight of it, the <a href="http://www.w3.org/TR/REC-xml/">XML RFC</a> does seem to permit this as valid and parsable XML.<br />
Taking a look at section on <a href="http://www.w3.org/TR/REC-xml/#sec-starttags">start tags</a> would appear to allow for this:</p>
<p>A start tag is defined by</p>
<p><span class="codeblock">STag ::= &#8216;&lt;&#8217; Name (S Attribute)* S? &#8216;&gt;&#8217;</span></p>
<p>where <span class="code">Name</span> is defined as</p>
<p><span class="codeblock">Name ::= (Letter | &#8216;_&#8217; | &#8216;:&#8217;) (NameChar)*</span></p>
<p>and <span class="code">NameChar</span> is defined as</p>
<p><span class="codeblock">NameChar ::= Letter | Digit | &#8216;.&#8217; | &#8216;-&#8217; | &#8216;_&#8217; | &#8216;:&#8217; | CombiningChar | Extender</span></p>
<p>What this says to me is that this is an allowable start tag. In fact, from a syntactical standpoint, there seems to be no limit to the use of colons in element names.</p>
<p>However, the RFC does also have this to say about the use of colons:</p>
<p>&#8220;The Namespaces in XML Recommendation <a href="http://www.w3.org/TR/REC-xml/#xml-names">[XML Names]</a> assigns a meaning to names containing colon characters. Therefore, authors should not use the colon in XML names except for namespace purposes, but XML processors must accept the colon as a name character.&#8221;</p>
<p>In other words &#8211; other parts of related XML specifications reserve the colon, but from a purely XML markup standpoint this would appear to be valid.</p>
<p>It may be a WTF, and it certainly isn&#8217;t nice but I think it is actually <em>valid</em> and is not &#8220;<em>wrong, period</em>&#8221; (as claimed), just &#8220;<em>wrong, it makes me feel weird</em>&#8220;!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.roblevine.co.uk/?feed=rss2&amp;p=11</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
