Data Structure Study Notes <9> Hash Search

1. Hash Table#

1. Introduction to Concepts#

Management of variables in compilation: Dynamic Search Problem

Using search trees? — Comparing two variable names (strings) is not efficient
Convert strings to numbers, then process? — This is the idea of hash search.
Known search methods:
Sequential Search O(N)
Binary Search (Static Search) O(log~2~N)
Binary Search Tree O(h) where h is the height of the binary search tree
Balanced Binary Tree O(log~2~N)

Insert image description here

Question: How to quickly search for the required keywords? What if the keywords are not easy to compare?#

The essence of searching: Finding the position of known objects

Ordered arrangement of objects: total order (completely ordered, e.g., binary search), partial order (some keywords have an order, e.g., search tree)
Directly "calculate" the position of the object: Hashing

Basic Work of Hashing#

Calculate position: Construct a hash function to determine the storage position of keywords.
Resolve conflicts: Apply some strategy to resolve the issue of multiple keywords having the same position.

Time Complexity#

The time complexity is almost constant: O(1), meaning the search time is independent of the problem size!

2. Hash Search#

(1) Basic Idea#

Use the keyword key as the independent variable, and through a determined function h (hash function), calculate the corresponding function value h(key) as the storage address of the data object.
Different keywords may map to the same hash address, i.e., h(key~i~) = h(key~j~) (when key~i~ ≠ key~j~), which is called "Collision." — Requires some conflict resolution strategy.

In the case of no conflict, searching only requires calculating the address using the hash function.
Insert image description here

Load Factor#

Load Factor (Loading Factor): Let the size of the hash table be m, and the number of elements filled in the table be n, then α = n / m is called the load factor of the hash table.

As shown in the figure, α = 11 / 17 ≈ 0.65
If there is no overflow, then T~query~ = T~insert~ = T~delete~ = O(1)

Rehashing#

When the number of elements in the hash table is too large (i.e., the load factor α is too high), the search efficiency will decrease.
- The practical maximum load factor is generally taken as 0.5 ≤ α ≤ 0.85
The solution is to double the size of the hash table, a process called "Rehashing."

(2) Construction of Hash Functions#

Two Factors#

A "good" hash function generally considers the following two factors:

Simple calculation to improve conversion speed.
Uniform distribution of address space corresponding to keywords to minimize collisions.

Hash Function Construction for Numeric Keywords#

① Direct Addressing Method#

Take a certain linear function value of the keyword as the hash address, i.e., h(key) = a × key + b (a, b are constants).
Insert image description here

② Division Remainder Method (Commonly Used)#

The hash function is: h(key) = key mod p
e.g., h(key) = key %17
Insert image description here
Generally, to ensure that the address space corresponding to keywords is distributed uniformly, p is taken as a prime number.

③ Digit Analysis Method#

By analyzing the variation of numeric keywords at each digit, take comparatively random digits as the hash address.

Insert image description here

④ Folding Method#

Split the keyword into several parts of the same digit, then add them together.
Insert image description here

⑤ Square Middle Method#

Split the keyword into several parts of the same digit, then add them together

Hash Function Construction for Character Keywords#

Simple Hash Function — ASCII Code Addition#

Define the hash function for character-type keywords key as follows: h(key) = (Σkey[i]) mod TableSize
Severe collisions!!

Simple Improvement — First 3 Characters Shift Method#

h(key) = (key[0] × 27^2^ + key[1] ×27 + key[2]) mod TableSize
Still collisions, space waste.

Good Hash Function — Shift Method#

Design all n characters of the keyword, and distribute them well.
h(key) = ($\begin{matrix} \sum_{i=0}^{n-1}key[n-i-1] \times32^i \end{matrix}$) mod TableSize
The code is as follows:

typedef string DataType; // Data type
typedef int Index; // Hash index
// Returns the index after calculation by the hash function
Index Hash(string Key, int TableSize) { 
    unsigned int h = 0; // Hash function value, initialized to 0
    int len = Key.length();
    for(int i = 0; i < len; ++i) 
        h = (h << 5) + Key[i];
    
    return h % TableSize;
}

(3) Conflict Handling in Hash Search#

Common strategies for handling conflicts are as follows:

Change position — Open Addressing Method
Organize conflicting objects at the same position together — Chain Addressing Method

Open Addressing Method#

Open Addressing Method, once a collision occurs (the position already has other elements), it searches for another empty address according to some rules. Its advantage is that the hash table is an array, storage efficiency is high, random search; the disadvantage is that the hash table has a "clustering" phenomenon. In the open addressing hash method, the delete operation needs to be very careful, it can only be "lazily deleted," meaning a deletion marker needs to be added instead of a real deletion, so that it won't "break the chain" during searching. Its space can be reused during the next insertion.

If a i-th collision occurs, the next address to probe will increase d~i~, the basic formula is: h~i~(key) = (h(key) + d~i~) mode TableSize(1 ≤ i < TableSize)
d~i~ determines the different conflict resolution schemes: Linear Probing (d~i~ = i), Quadratic Probing (d~i~ = ± i^2^), Double Hashing (d~i~ = i*h~2~(key)).

① Linear Probing Method#

Use the increment sequence 1, 2, …, (TableSize - 1) to cyclically probe the next storage address.

Insert image description here

② Quadratic Probing Method (Secondary Probing)#

Use the increment sequence 1^2^, -1^2^, 2^2^, -2^2^, …, q^2^, -q^2^ and q ≤ | TableSize/2 | to cyclically probe the next storage address.
Insert image description here

Can quadratic probing find all spaces? There is a theorem as follows:

If the length of the hash table TableSize is a certain 4k+3 (k is a positive integer) form prime number, then quadratic probing can probe the entire hash table space.

// Find Key element, here using quadratic probing to handle conflicts
Index Find(ptrHash H, DataType Key) {
    Index nowp, newp;
    int Cnum = 0; // Record the number of collisions
    newp = nowp = Hash(Key, H->TableSize);
    // If the unit at this position is not empty and is not the element to be found, a collision occurs
    while(H->Units[newp].flag != Empty && H->Units[newp].data != Key) {
        ++Cnum; // Increase collision count by one
        if(++Cnum % 2) {
            newp = nowp + (Cnum+1)*(Cnum+1)/4; // Increment is +i^2, i is (Cnum+1)/2
            if(newp >= H->TableSize)
                newp = newp % H->TableSize;
        } else {
            newp = nowp - Cnum*Cnum/4; // Increment is -i^2, i is Cnum/2
            while(newp < 0)
                newp += H->TableSize;
        }
    }
    return newp; // Return position, if this position is an empty unit, it means not found
}

③ Double Hashing Method#

d~i~ is i*h~2~(key), h~2~(key) is another hash function, the probing sequence is h~2~(key), 2h~2~(key), 3h~2~(key)…

For any key, h~2~(key) ≠ 0!
The probing sequence should also ensure that all hash storage units can be probed, the following form has good effects: h~2~(key) = p - (key mod p)
where p < TableSize, p and TableSize are both prime numbers.

Separate Chaining Method#

The separate chaining method stores all conflicting keywords at the corresponding position in a singly linked list. The advantage is that the deletion of keywords does not require "lazy deletion," thus avoiding storage garbage; the disadvantage is that the storage efficiency and search efficiency of the linked list part are relatively low. The uneven length of the linked list can lead to a significant decrease in time efficiency.
Insert image description here

(4) Operation Set of Hash Search (Code)#

Here is the hash search for strings, using the shift method for the Hash function and the quadratic probing method for conflict resolution; others can be modified based on this.

Definition#

const int MaxSize = 100000;
typedef int Index; // Hash index
typedef string DataType; // Element type stored in hash unit
// Hash unit state type, corresponding to: valid element, empty unit, deleted element
typedef enum {Legitimate, Empty, Deleted} EntryType;
struct HashNode {   // Hash table unit type
    DataType data;      // Store element
    EntryType flag;     // Unit state
};
struct HashTable {  // Hash table type
    int TableSize;      // Table length
    HashNode *Units;    // Array storing hash units
};
typedef HashTable *ptrHash;

Get Prime Number#

Return the smallest prime number greater than N and not exceeding MaxSize, to ensure that the maximum length of the hash table is prime, thus making the address space corresponding to keywords as uniform as possible.

// Return the smallest prime number greater than N and not exceeding MaxSize, to ensure that the maximum length of the hash table is prime
int NextPrime(int N) {
    int i, p = (N%2) ? N+2 : N+1; // Start from the next odd number p greater than N
    while(p <= MaxSize) {
        for(i = (int)sqrt(p); i >= 2; i--) 
            if(!(p % i)) break; // Not a prime number
        if(i == 2) break; // for ends normally, it is a prime number
        else p += 2; // Try the next odd number
    }
    return p;
}

Create Empty Table#

Create an empty table with a length greater than TableSize. (Ensure the length is prime)

ptrHash CreateTable(int TableSize) {
    ptrHash H;
    int i;
    H = new HashTable;
    H->TableSize = NextPrime(TableSize);
    H->Units = new HashNode[H->TableSize];
    for(int i = 0; i < H->TableSize; ++i) 
        H->Units[i].flag = Empty;
    return H;
}

Hash Calculation#

Return the index after calculation by the hash function. Here the keyword data type is string, using the shift method for hashing.

Index Hash(DataType Key, int TableSize) { 
    unsigned int h = 0; // Hash function value, initialized to 0
    string str = Key.str;
    int len = str.length();
    for(int i = 0; i < len; ++i) 
        h = (h << 5) + str[i];
    return h % TableSize;
}

Search Operation#

Find the Key element, here using quadratic probing to handle conflicts, returning the position index; if this position is an empty unit, it indicates not found.

Index Find(ptrHash H, DataType Key) {
    Index nowp, newp;
    int Cnum = 0; // Record the number of collisions
    newp = nowp = Hash(Key, H->TableSize);
    // If the unit at this position is not empty and is not the element to be found, a collision occurs
    while(H->Units[newp].flag != Empty && H->Units[newp].data != Key) {
        ++Cnum; // Increase collision count by one
        if(++Cnum % 2) {
            newp = nowp + (Cnum+1)*(Cnum+1)/4; // Increment is +i^2, i is (Cnum+1)/2
            if(newp >= H->TableSize)
                newp = newp % H->TableSize;
        } else {
            newp = nowp - Cnum*Cnum/4; // Increment is -i^2, i is Cnum/2
            while(newp < 0)
                newp += H->TableSize;
        }
    }
    return newp; // Return position, if this position is an empty unit, it means not found
}

Insert Operation#

Insert the Key into the table, returning success or failure; failure indicates that the key already exists.

bool Insert(ptrHash H, DataType Key) {
    Index p = Find(H, Key);
    if(H->Units[p].flag != Legitimate) { // This position can insert an element
        H->Units[p].flag = Legitimate;
        H->Units[p].data = Key;
        // Other operations
        return true;
    } else { // The key already exists
        // Other operations
        return false;
    }
}

(5) Performance Analysis#

The search efficiency of hash tables is analyzed through the following factors:

Average Successful Search Length (ASLs)
Average Unsuccessful Search Length (ASLu)

Taking the example in linear probing method, the performance of this hash table is analyzed as follows:

Insert image description here

Its ASLs is the average number of comparisons for keywords in the search table (i.e., the number of collisions plus 1)
ASLs = (1+7+1+1+2+1+4+2+4)/9 = 23/9 ≈ 2.56
Its ASLu is the average number of comparisons for keywords not in the hash table (unsuccessful).
General method: Classify keywords not in the hash table into several categories, such as by their H(key) value; here H(key) = key mod 11, analyze in 11 categories, ASLu as follows:
ASLu = (3+2+1+2+1+1+1+9+8+7+6)/11 = 41/11 ≈ 3.73

The number of comparisons for keywords depends on the number of collisions, and the factors affecting the number of collisions are as follows:

Whether the hash function is uniform.
The method of handling conflicts.
The load factor α of the hash table.
The impact of different conflict handling methods and load factors on efficiency is as follows:

① Search Performance of Linear Probing Method#

Insert image description here

② Search Performance of Quadratic Probing and Double Hashing Methods#

Insert image description here
When the load factor α < 0.5, the expected number of probes for various probing methods is not large and relatively close. As α increases, the expected number of probes for the linear probing method increases rapidly, and the expected number of probes for unsuccessful searches and insert operations is greater than that for successful searches. Therefore, the reasonable maximum load factor should not exceed 0.85.

③ Search Performance of Separate Chaining Method#

Insert image description here

Summary#

Choosing a suitable h(key), the expected search efficiency of the hashing method is constant O(1), which is almost independent of the size n of the keyword space! It is also suitable for problems where direct comparison of keywords is computationally intensive.
It is based on a small α, thus the hashing method is a space-for-time method.
The storage of the hashing method is random for keywords, not convenient for sequential search of keywords, and also not suitable for range searches or maximum and minimum value searches.