HASH FUNCTIONS AND TABLES

Richard Kay, 12 May 2000

1. Introduction

Hashing functions, tables and algorithms have many applications in computing. These include applications in security conscious computing and techniques enabling more efficient sorting and searching strategies. Much research and investigation has been carried out into this area , the results of which includes freely available programming libraries with full source code which efficiently implements many of the applications described below in a "black box" manner such that a programmer making use of these techniques can build upon work already carried out by other programmers without needing to know the full implementation details. The Perl and Java languages for example provide access to hashes as integral language and standard library features.

2. Hash functions

A hash function generates a signature from a data object, with the signature typically being shorter than the object which the signature represents. Hash functions are generally one-way, meaning that it is easy to generate the signature from the object but difficult and expensive or impossible to generate the object from the signature. This technique has some important applications in computing security, including:

a. Generating the encrypted signatures of passwords so that the actual passwords do not need to be stored on systems which authenticate these.

b. Storing sets of file signatures off-line or in write-once storage so that suspicious file and system modifications can be detected by periodically comparing expected and actual signatures.

c. Generating the keys and digital signatures used in e-commerce and for encrypting private messages and sensitive data.

3. Use of hash functions in data storage

The application of hash functions covered in these notes is for an efficient method of data storage and access known as the hash table. The hash function is used for locating data within a hash table based on the signatures computed from record keys. Storing data with the location based on the content enables the most rapid possible searching for data based on the key. This also requires that the hash function is computed quickly.

If the data keys meet certain criteria this technique also enables the data to be stored in sorted order, such that creation of the hash table is equivalent to a "pigeon hole" sort, an algorithm named after the way mail used to be sorted by hand in postal sorting offices. This algorithm gives a number of comparisons and record moves both to the order of N, i.e. approximately 1 comparison or move is needed per record to find or store the data in sorted order. This is more efficient than any other known sorting algorithm, with the best alternatives such as quick sort and merge sort giving numbers of moves and comparisons both to the order of Nlog2N where there are N data items.

We use such a hashing technique implicitly when deciding where to open a dictionary in order most quickly to find a word (the "key") and definition (the rest of the data record associated with the key or the "value"). For example if searching for the word "corrugated" we are likely quickly to estimate from the fact that the word is about 2/3rds through the words starting with the third of of the 26 letters of the alphabet that "corrugated" is likely to be approximately 1/10th of the way through the dictionary. We would therefore probably start looking for this word by opening the dictionary 1/10th of the way through . Our second guess would almost certainly also be biassed based on how close the word revealed by our first guess was to our desired key to give a better approximation to the dictionary location as compared to an exact division by 2 of the range.

This approach results in our search being quicker than the conventional binary split search which would have taken to the order of log2N comparisons when searching for one word within a range of N words.

4. Handling collisions

A practical difficulty with this technique arises from the fact that the hash function used to translate between the key values and data locations within the hash table is likely to give the same locations for different keys, particularly if the range of possible keys is greater than the number of data records. E.G if we want to store dictionary entries for the keys: "perturbed" and "persuaded" a hash function designed to store these entries in sorted order within a table of a practical size for locating up to say 1000 entries might well resolve these keys to the same location within the table.

For any likely practical applications of this technique collisions should be considered inevitable. The exception would be where there is a unique one to one mapping between keys and table locations. E.G if the data has unique integer keys within the range -50 and +50 then the hash function y=h(x) storage would be y=x+50; when using a 'C' style array . Three approaches are used for handling these collisions:

a. Open hashing

If a location for a key is already occupied by another record go to the next unused table location and store the record there. The next location from the top is the first location (array_element[0] ) . If the remainder is taken when dividing the position number by table size; i.e. array_location = position_number % array_size; this modulus operation will always map any positive integral position_number to a valid array_location . For the next-unused location algorithm to work it follows that there must be more locations in the hash table than stored records. Also if deletion of data is required there must also be some means of flagging data in a location having been deleted as seperate from a position as being not previously used, otherwise records which may have been located after a deletion point will no longer be efficiently accessible.

The following function would return the index used for a particular key:

int search(char *key, char *table[], int table_size){
    /* finds key in open hash table, returns -1 if not found. Unused table entries
        are set to NULL. */
    int index,start;
    index=hashfunc(key); /* start of search based on hash function */
    if (strcmp(table[index],key) == 0) /* key found in expected position */
       return index;
     start=index;
     while(1){
        index++;
        index=index%table_size; /* avoids overflow, continues at start of table */
        if (strcmp(table[index],key)==0) /* key found */
            return index; 
        if (table[index]==NULL) /* null (unused) position found so key not present */
            return -1; /* signals key not present */
        if (index==start){ /* error condition indicating all table locations searched and                  
                                        used for other data ! */ 
          fprintf(stderr,"hash table full error: allocate more space !\n");
          return -1;
        }
     } 
}
A variant of this function might be used to determine where to store a new key.

b. Quadratic hashing

If a location for a key is already occupied by another record find the next unused location by trying locations separated from the calculated location by 1,4,9,15,25,49... positions (i.e the series of perfect squares) on from the original record position (using the modulus operation described for open hashing). The advantage of this approach is that data is less likely to become clustered (and therefore requiring more access operations) than would occur with open hashing. Calculating the successive squares can also be reduced to quicker addition by virtue of the fact that the series of quadratic locations 0,1,4,9,16,25... from the origin are separated by the series of jumps 1,3,5,7,9... from each other. This approach is more likely to find an empty position if the hash table size is a prime number; if not there is a greater risk of jumps skipping over unused positions and revisiting previously searched ones.

c. Chained hashing

This involves co-location of 0 or more data items using a singly linked list starting at the array location returned by the hash function. If the array size and hash function are chosen in order to reduce the frequency of collisions such that say, 90% of records are the only record at their array location, then it is probable that a further 9% will be chained in a list length of 2, and 0.9% will be triply located, 0.09% will by quadruply located etc. This would result in an average number of comparisons needed to find a single data item of approximately (0.9n + 0.09n*1.5 + 0.009n*2 + 0.0009n*2.5...)/n which is 1.0555555, or close enough to 1.0 to make little difference.

In practice, the most popular implementations of hash tables involve chained hashing. So that a search for a given key can indicate that data for that key is not present a NULL pointer will indicate an unused position (or the last of one or more records in a linked list) in a chained hash table. For open and quadradic hashing, if the hash table is used to store data directly as opposed to data references or pointers, a sentinel value will need to be selected to indicate an unused position which can never occur within the stored data itself and as noted above, a secondary sentinel will be required to indicate positions for deleted values which would be traversed when searching for an key not found at a location occupied by another record.

5. A hash function selected for sorting

 

If the application is intended to enable most efficient output of data in sorted order the hash function will need to match the range of possible keys, probably in a linear manner, to the range of available hash table locations. Supposing the hash function y=f1(x) were to take the first three letters from the alphabetic key, convert these to lower case and calculate positions 0 for a, 1 for b, 2 for c etc. up to 25 for z. The value of the first letter could be multiplied by 625, added to the value of the second letter multiplied by 25 and added to the value of the third letter. In 'C':
y1=625*(tolower(key[0]) - 'a') + 25*(tolower(key[1]) - 'a') + 
  (tolower(key[2]) - 'a');
This function would give the lowest key "aaa" a hash of 0 and the highest key "zzz" a hash of 16275. Suppose our table size were 997 we could then map this range (0-16275) to an array index between 0 and 996 using, in 'C' :
y=(int)(y1*995.999/16275; /* note slight rounding down of 
     range and use of float arithmetic to avoid rounding and 
     overflow bugs */
The complete hash function might then look like this:
int hashfunc(char *key, int array_size){
    /* calculates linear position of word: key within a 
      case-insensitive, sorted chained hash array.
      This function is designed for keys of 3 or more alphas, 
      but should be safe otherwise*/
    unsigned long int  y=0; int i;
    for(i=0;i<3 && key[i] != '\0'; i++) 
        y=25y + (tolower(key[i]) - 'a'); 
    y=(long int)(y*(array_size - 1.001)/16275; /* note slight rounding 
            down and cast to float to avoid  rounding  and overflow 
            arithmetic bugs*/
     if(y>=array_size) /* if only alphabetic (a-z), (A-Z) data this 
                          won't happen but avoids overflow if not */
       y=y%array_size; /* always gets 0 <= y <= array-size - 1 */
     return y; /* location to be used within table */
}

6. A hash function optimised for searching

If the application is intended to enable the fastest possible random searching and access of data, the hash function is designed to reduce the number of collisions which otherwise result in longer searches for clustered data. Where this approach is taken the hash function y=f(x) tends to attempt widely to scatter the values of y for used values of x. This will typically involve generation of an integer y1 computed from the key x which is typically large compared to the table size and which is then converted to a positive integer y within the storage location range using the modulus (remaindering) operation: y=y1%table_size;

Where the key distribution pattern is unknown this division - remainder method is thought to give best performance and is commonly used (1).

The following hash function adopts this approach:

int hashfunc(char *key, int array_size){
    /* scatters position of key within a rapid search hash array intended to 
        reduce collision probability given unknown key distribution. */
    unsigned long int  y=0; int i;
    for(i=0;i<5 && key[i] != '\0'; i++) 
        y=29*y + (tolower(key[i]) - 'a'); 
     return y%array_size;
}
The multiplier 29 is used in preference to 25 as the first prime number greater than 25.

7. Sizing hash tables

Open and quadratic (direct storage) methods which can only store 1 record per hash table location clearly need more array locations than records, while the performance of chained hashes will deteriorate more gradually as the occupancy ratio increases beyond 1 record per array location, in the absolute worst case to that of the chained structure (e.g. single linked list) indexed at a single "array" location. Collision and clustering problems are more likely to occur if the number of records is close to the table size. A good rule of thumb based on practical experience suggests that a table with n keys should have a table size of at least 3n/2. Efficient use of quadratic hash tables suggests this minimum size should be increased to the next prime number of the form 4k+3 where k is an integer, as this guarantees that every slot will be visited (2). Primes which meet this requirement include 11,19,23,31,43,47,59,67,79 (e.g. 11 = 4*2 +3 ) and very many others. This consideration also applies to non-sorting applications where hash functions using modulus operations involving prime numbers are more likely to scatter keys over the positions available.

An analysis of relative search performance (2) against table occupancy ratios shows how performance worsens as the table fills:
 

Occupancy

Average probes to find key

0=empty

1=full

quadratic

hash

linear

hash

0.1

1.05

1.06

0.5

1.39

1.50

0.75

1.85

2.50

0.9

2.56

5.50

0.95

3.15

10.50 

What approach should be taken if , despite reasonable allowances for growth, records are added beyond an optimal size for the table ? Practical applications are most likely to include functions which read data from an external file and store it into a hash table, and which enable data in an updated table to be written back to the file. A solution would therefore involve dynamically allocating the space needed to store the hash table based on the number of records in the file and reusing these read/write functions whenever an unacceptable ratio is detected between the size of the table and the number of records stored in it.

In this event the table read and write functions would be reused to reorganise the hash table. For this reason the hash functions described above take the table size into account and will scale to handle changes in this value.

Further reading:

(1) Loomis, Mary E.S. "Data Management and File Structures" Second Edition Prentice Hall International Editions

(2) Barron, D.W. & Bishop J.M. "Advanced Programming: A Practical Course" John Wiley & Sons
 
 
 
 
 
 
 
 

HASH FUNCTION AND TABLE OHPs

1. Definition of hash function:

A hash function generates a signature from a data object

2. Applications of hash functions:

Password Security - avoid having to store password anywhere

Filesystem Security - store signatures off line and monitor changes

Digital Signatures - proves signatory owns a particular public key

Data Encryption - creating symmetric and session keys

Data Processing - Indexing data stored in hash tables

3. Properties of hash functions

- Various desirable properties, function is designed to suit application

- Often one way - expensive or impossible to reproduce or simulate an object obj such that y=f(obj) generates a desired value of y.

- Sorting requires a linear scaling from key range to data storage location.

- If intended to speed data access the function must be computed quickly.

- If intended to spread data evenly in hash table function must have good scattering properties.
 
 
 
 
 

Hash Tables as a Data Structure

1. Advantages:

- Reduces search time to the order of 1 comparison per data item (in practice speed dependent on time to compute hash function).

- Can be used to store data in sorted order such that an order of 1 comparisons and moves per item sort (pigeonhole sort) is possible.

2. Disadvantages:

- Need dynamically to size table for efficient storage, more complex than some other methods.

- Collisions almost inevitable except in rare cases, need to be handled

- Best performance requires a larger table than number of data items to be stored - some memory inefficiency and unused positions.

Collision Handling Approaches

1. Open Hashing

- If position already in use go sequentially to next unused position and store data there. If at top of table, start again at bottom.

- Simple approach but leads to more data clustering

2. Quadratic hashing

- If position already in use go using a series of steps from this position such that each successive step is in the series of squares 1,4,9,16,25 ... etc, until the first unused position and store data there. Use modulus arithmetic to find position if positions higher than the top of table would be accessed as with open hashing.

- Avoids clustering and quick to calculate steps.

- Table size should be a prime or a 4k + 3 prime to ensure access to all table locations

3. Chained Hashing

- each table location either NULL or a Linked List

- more complex to implement than open or quadratic hashing but less likely to result in performance degradation

- works even if there are more data objects than table positions

Empty Rows and Collisions in A Chained Hash Designed For Sorting

Unused rows e.g. 3-9, 13-15, 19 etc. within the range 0 - 42 are not listed
Row     word 1          word 2          word 3          word 4
  0     abacus          
  1     banana          bean            
  2     bubble          
 10     green           
 11     hash            hen             
 12     hinge           
 16     key             
 17     knee            
 18     list            
 25     print           
 27     queue           
 28     return          
 30     sort            stack           
 31     strangle        stretch         string          table           
 32     turn            
 33     uncertain       
 35     very            
 37     window          
 39     xray            
 40     yellow          
 41     zany            
 42     zygote
Occupancy of rows: 22/43 = 0.51

Data items per row: 28/43 = 0.65

Average number of comparisons to find or sort data item: 33/28 = 1.18