a. Generating the encrypted signatures of passwords so that the actual passwords do not need to be stored on systems which authenticate these.
b. Storing sets of file signatures off-line or in write-once storage so that suspicious file and system modifications can be detected by periodically comparing expected and actual signatures.
c. Generating the keys and digital signatures used in e-commerce and for encrypting private messages and sensitive data.
If the data keys meet certain criteria this technique also enables the data to be stored in sorted order, such that creation of the hash table is equivalent to a "pigeon hole" sort, an algorithm named after the way mail used to be sorted by hand in postal sorting offices. This algorithm gives a number of comparisons and record moves both to the order of N, i.e. approximately 1 comparison or move is needed per record to find or store the data in sorted order. This is more efficient than any other known sorting algorithm, with the best alternatives such as quick sort and merge sort giving numbers of moves and comparisons both to the order of Nlog2N where there are N data items.
We use such a hashing technique implicitly when deciding where to open a dictionary in order most quickly to find a word (the "key") and definition (the rest of the data record associated with the key or the "value"). For example if searching for the word "corrugated" we are likely quickly to estimate from the fact that the word is about 2/3rds through the words starting with the third of of the 26 letters of the alphabet that "corrugated" is likely to be approximately 1/10th of the way through the dictionary. We would therefore probably start looking for this word by opening the dictionary 1/10th of the way through . Our second guess would almost certainly also be biassed based on how close the word revealed by our first guess was to our desired key to give a better approximation to the dictionary location as compared to an exact division by 2 of the range.
This approach results in our search being quicker than the conventional binary split search which would have taken to the order of log2N comparisons when searching for one word within a range of N words.
For any likely practical applications of this technique collisions should be considered inevitable. The exception would be where there is a unique one to one mapping between keys and table locations. E.G if the data has unique integer keys within the range -50 and +50 then the hash function y=h(x) storage would be y=x+50; when using a 'C' style array . Three approaches are used for handling these collisions:
The following function would return the index used for a particular key:
int search(char *key, char *table[], int table_size){
/* finds key in open hash table, returns -1 if not found. Unused table entries
are set to NULL. */
int index,start;
index=hashfunc(key); /* start of search based on hash function */
if (strcmp(table[index],key) == 0) /* key found in expected position */
return index;
start=index;
while(1){
index++;
index=index%table_size; /* avoids overflow, continues at start of table */
if (strcmp(table[index],key)==0) /* key found */
return index;
if (table[index]==NULL) /* null (unused) position found so key not present */
return -1; /* signals key not present */
if (index==start){ /* error condition indicating all table locations searched and
used for other data ! */
fprintf(stderr,"hash table full error: allocate more space !\n");
return -1;
}
}
}
A variant of this function might be used to determine where to store a
new key.
In practice, the most popular implementations of hash tables involve chained hashing. So that a search for a given key can indicate that data for that key is not present a NULL pointer will indicate an unused position (or the last of one or more records in a linked list) in a chained hash table. For open and quadradic hashing, if the hash table is used to store data directly as opposed to data references or pointers, a sentinel value will need to be selected to indicate an unused position which can never occur within the stored data itself and as noted above, a secondary sentinel will be required to indicate positions for deleted values which would be traversed when searching for an key not found at a location occupied by another record.
y1=625*(tolower(key[0]) - 'a') + 25*(tolower(key[1]) - 'a') + (tolower(key[2]) - 'a');This function would give the lowest key "aaa" a hash of 0 and the highest key "zzz" a hash of 16275. Suppose our table size were 997 we could then map this range (0-16275) to an array index between 0 and 996 using, in 'C' :
y=(int)(y1*995.999/16275; /* note slight rounding down of range and use of float arithmetic to avoid rounding and overflow bugs */The complete hash function might then look like this:
int hashfunc(char *key, int array_size){
/* calculates linear position of word: key within a
case-insensitive, sorted chained hash array.
This function is designed for keys of 3 or more alphas,
but should be safe otherwise*/
unsigned long int y=0; int i;
for(i=0;i<3 && key[i] != '\0'; i++)
y=25y + (tolower(key[i]) - 'a');
y=(long int)(y*(array_size - 1.001)/16275; /* note slight rounding
down and cast to float to avoid rounding and overflow
arithmetic bugs*/
if(y>=array_size) /* if only alphabetic (a-z), (A-Z) data this
won't happen but avoids overflow if not */
y=y%array_size; /* always gets 0 <= y <= array-size - 1 */
return y; /* location to be used within table */
}
Where the key distribution pattern is unknown this division - remainder method is thought to give best performance and is commonly used (1).
The following hash function adopts this approach:
int hashfunc(char *key, int array_size){
/* scatters position of key within a rapid search hash array intended to
reduce collision probability given unknown key distribution. */
unsigned long int y=0; int i;
for(i=0;i<5 && key[i] != '\0'; i++)
y=29*y + (tolower(key[i]) - 'a');
return y%array_size;
}
The multiplier 29 is used in preference to 25 as the first prime number
greater than 25.
An analysis of relative search performance (2) against table occupancy
ratios shows how performance worsens as the table fills:
|
Occupancy |
Average probes to find key |
|
|
0=empty 1=full |
quadratic hash |
linear hash |
|
0.1 |
1.05 |
1.06 |
|
0.5 |
1.39 |
1.50 |
|
0.75 |
1.85 |
2.50 |
|
0.9 |
2.56 |
5.50 |
|
0.95 |
3.15 |
10.50 |
What approach should be taken if , despite reasonable allowances for growth, records are added beyond an optimal size for the table ? Practical applications are most likely to include functions which read data from an external file and store it into a hash table, and which enable data in an updated table to be written back to the file. A solution would therefore involve dynamically allocating the space needed to store the hash table based on the number of records in the file and reusing these read/write functions whenever an unacceptable ratio is detected between the size of the table and the number of records stored in it.
In this event the table read and write functions would be reused to reorganise the hash table. For this reason the hash functions described above take the table size into account and will scale to handle changes in this value.
(2) Barron, D.W. & Bishop J.M. "Advanced Programming: A Practical
Course" John Wiley & Sons
Filesystem Security - store signatures off line and monitor changes
Digital Signatures - proves signatory owns a particular public key
Data Encryption - creating symmetric and session keys
Data Processing - Indexing data stored in hash tables
- Often one way - expensive or impossible to reproduce or simulate an object obj such that y=f(obj) generates a desired value of y.
- Sorting requires a linear scaling from key range to data storage location.
- If intended to speed data access the function must be computed quickly.
- If intended to spread data evenly in hash table function must have
good scattering properties.
- Can be used to store data in sorted order such that an order of 1 comparisons and moves per item sort (pigeonhole sort) is possible.
- Collisions almost inevitable except in rare cases, need to be handled
- Best performance requires a larger table than number of data items to be stored - some memory inefficiency and unused positions.
- Simple approach but leads to more data clustering
- Avoids clustering and quick to calculate steps.
- Table size should be a prime or a 4k + 3 prime to ensure access to all table locations
- more complex to implement than open or quadratic hashing but less likely to result in performance degradation
- works even if there are more data objects than table positions
Row word 1 word 2 word 3 word 4 0 abacus 1 banana bean 2 bubble 10 green 11 hash hen 12 hinge 16 key 17 knee 18 list 25 print 27 queue 28 return 30 sort stack 31 strangle stretch string table 32 turn 33 uncertain 35 very 37 window 39 xray 40 yellow 41 zany 42 zygoteOccupancy of rows: 22/43 = 0.51
Data items per row: 28/43 = 0.65
Average number of comparisons to find or sort data item: 33/28 = 1.18