Hash Codes
Hash Codes
Whenever equality is redefined for a type, the hash code computation for that type needs to be redefined in a consistent way. This is done by overriding that type’s GetHashCode method. In order for hashing to be implemented correctly and efficiently, this method should satisfy the following goals:
- Equal keys must have the same hash code. This is necessary in order for the Dictionary<TKey, TValue> class to be able to find a given key that it has stored. On the other hand, because the number of possible keys is usually larger than the number of possible hash codes, unequal keys are also allowed to have the same hash code.
- The computation should be done quickly.
- Hash codes should be uniformly distributed even if the keys are not.
The last goal above may seem rather daunting, particularly in light of
our desire for a quick computation. In fact, it is impossible to
guarantee in general — provided there are more than
A guiding principle in developing hashing techniques is to use all information available in the key. By using all the information, we will be sure to use the information that distinguishes this key from other keys. This information includes not just the values of the individual parts of the the key, but also the order in which they occur, provided this order is significant in distinguishing unequal keys. For example, the strings, “being” and “begin” contain the same letters, but are different because the letters occur in a different order.
One specific technique initializes the hash code to 0, then processes the key one component at a time. These components may be bytes, characters, or other parts no larger than 32 bits each. For example, for the Nim board positions discussed in “Memoization”, the components would be the number of stones on each pile, the limit for each pile, and the total number of piles (to distinguish between a board ending with empty piles and a board with fewer piles). For each component, it does the following:
- Multiply the hash code by some fixed odd integer.
- Add the current component to the hash code.
Due to the repeated multiplications, the above computation will often overflow an int. This is not a problem — the remaining bits are sufficient for the hash code.
In order to understand this computation a little better, let’s first
ignore the effect of this overflow. We’ll denote the fixed odd integer
by
Because the above is a polynomial, this hashing scheme is called
polynomial hashing. While the computation itself is efficient,
performing just a couple of arithmetic operations on each component, the
result is to multiply each component by a unique value (
Now let’s consider the effect of overflow on the above polynomial. What
this does is to keep only the low-order 32 bits of the value of the
polynomial. Looking at it another way, we end up multiplying
There are other potential problems with using certain odd numbers for
It turns out that this cyclic behavior occurs no matter what odd number
we use for
The other potential problem occurs when we are hashing fairly short
keys. In such cases, if
Putting all this together, choosing
We should always save the hash code in a private field after we compute it so that subsequent requests for the same hash code don’t result in repeating the computation. This can be done in either of two ways. One way is to compute it in an eager fashion by doing it in the constructor. When doing it this way, the GetHashCode method simply needs to return the value of the private field. While this is often the easiest way, it sometimes results in computing a hash code that we end up not using. The alternative is to compute it in a lazy fashion. This requires an extra private field of type bool. This field is used to indicate whether the hash code has been computed yet or not. With this approach, the GetHashCode method first checks this field to see if the hash code has been computed. If not, it computes the hash code and saves it in its field. In either case, it then returns the value of the hash code field.