## 1. Hash Table#

## 1. Introduction to Concepts#

Management of variables in compilation: **Dynamic Search Problem**

- Using search trees? — Comparing two variable names (strings)
**is not efficient** **Convert strings to numbers, then process?**— This is the idea of hash search.

Known search methods:- Sequential Search O(N)
- Binary Search (Static Search) O(log~2~N)
- Binary Search Tree O(h) where h is the height of the binary search tree
- Balanced Binary Tree O(log~2~N)

##### Question: How to quickly search for the required keywords? What if the keywords are not easy to compare?#

The essence of searching: Finding the position of known objects

- Ordered arrangement of objects: total order (completely ordered, e.g., binary search), partial order (some keywords have an order, e.g., search tree)
- Directly
**"calculate"**the position of the object:**Hashing**

### Basic Work of Hashing#

**Calculate position: Construct a hash function**to determine the storage position of keywords.**Resolve conflicts: Apply some strategy to resolve the issue of multiple keywords having the same position.**

### Time Complexity#

The time complexity is almost constant: O(1), meaning the search time is independent of the problem size!

## 2. Hash Search#

### (1) Basic Idea#

- Use the keyword
**key**as the independent variable, and through a determined function h**(hash function)**, calculate the corresponding function value h(key) as the storage address of the data object. - Different keywords may map to the same hash address, i.e., h(key~i~) = h(key~j~) (when key~i~ ≠ key~j~), which is called
**"Collision."**— Requires some conflict resolution strategy.

In the case of no conflict, searching only requires calculating the address using the hash function.

#### Load Factor#

Load Factor (Loading Factor): Let the size of the hash table be m, and the number of elements filled in the table be n, then α = n / m is called the load factor of the hash table.

- As shown in the figure,
**α = 11 / 17 ≈ 0.65**

If there is no overflow, then T~query~ = T~insert~ = T~delete~ = O(1)

#### Rehashing#

- When the number of elements in the hash table is too large (i.e., the load factor α is too high), the search efficiency will decrease.
- The practical maximum load factor is generally taken as
**0.5 ≤ α ≤ 0.85**

- The practical maximum load factor is generally taken as
- The solution is to double the size of the hash table, a process called
**"Rehashing."**

### (2) Construction of Hash Functions#

#### Two Factors#

A "good" hash function generally considers the following two factors:

**Simple calculation**to improve conversion speed.**Uniform distribution of address space corresponding to keywords**to minimize collisions.

#### Hash Function Construction for Numeric Keywords#

##### ① Direct Addressing Method#

Take a certain linear function value of the keyword as the hash address, i.e., **h(key) = a × key + b** (a, b are constants).

##### ② Division Remainder Method (Commonly Used)#

The hash function is: **h(key) = key mod p**

e.g., h(key) = key %17

Generally, to ensure that the address space corresponding to keywords is distributed uniformly, **p is taken as a prime number.**

##### ③ Digit Analysis Method#

By analyzing the variation of numeric keywords at each digit, take **comparatively random digits** as the hash address.

##### ④ Folding Method#

Split the keyword into several parts of the same digit, then add them together.

##### ⑤ Square Middle Method#

#### Hash Function Construction for Character Keywords#

##### Simple Hash Function — ASCII Code Addition#

Define the hash function for character-type keywords key as follows: **h(key) = (Σkey[i]) mod TableSize**

**Severe collisions!!**

##### Simple Improvement — First 3 Characters Shift Method#

**h(key) = (key[0] × 27^2^ + key[1] ×27 + key[2]) mod TableSize**

**Still collisions, space waste.**

##### Good Hash Function — Shift Method#

Design all n characters of the keyword, and distribute them well.

h(key) = ($\begin{matrix} \sum_{i=0}^{n-1}key[n-i-1] \times32^i \end{matrix}$) mod TableSize

The code is as follows:

```
typedef string DataType; // Data type
typedef int Index; // Hash index
// Returns the index after calculation by the hash function
Index Hash(string Key, int TableSize) {
unsigned int h = 0; // Hash function value, initialized to 0
int len = Key.length();
for(int i = 0; i < len; ++i)
h = (h << 5) + Key[i];
return h % TableSize;
}
```

### (3) Conflict Handling in Hash Search#

Common strategies for handling conflicts are as follows:

- Change position —
**Open Addressing Method** - Organize conflicting objects at the same position together —
**Chain Addressing Method**

#### Open Addressing Method#

Open Addressing Method, once a **collision** occurs (the position already has other elements), it searches for **another empty address** according to some rules. Its advantage is that the hash table is an array, storage efficiency is high, random search; the disadvantage is that the hash table has a "clustering" phenomenon. In the open addressing hash method, the delete operation needs to be very careful, it can only be "lazily deleted," meaning a deletion marker needs to be added instead of a real deletion, so that it won't "break the chain" during searching. Its space can be reused during the next insertion.

- If a
**i-th collision**occurs, the next address to probe will**increase d~i~**, the basic formula is:**h~i~(key) = (h(key) + d~i~) mode TableSize(1 ≤ i < TableSize)** **d~i~**determines the different conflict resolution schemes:**Linear Probing (d~i~ = i), Quadratic Probing (d~i~ = ± i^2^), Double Hashing (d~i~ = i*h~2~(key)).**

#### ① Linear Probing Method#

Use the increment sequence 1, 2, …, (TableSize - 1) to cyclically probe the next storage address.

#### ② Quadratic Probing Method (Secondary Probing)#

Use the increment sequence 1^2^, -1^2^, 2^2^, -2^2^, …, q^2^, -q^2^ and q ≤ | TableSize/2 | to cyclically probe the next storage address.

Can quadratic probing find all spaces? There is a theorem as follows:

- If the length of the hash table TableSize is a certain
**4k+3**(k is a positive integer) form**prime number**, then**quadratic probing can probe the entire hash table space.**

```
// Find Key element, here using quadratic probing to handle conflicts
Index Find(ptrHash H, DataType Key) {
Index nowp, newp;
int Cnum = 0; // Record the number of collisions
newp = nowp = Hash(Key, H->TableSize);
// If the unit at this position is not empty and is not the element to be found, a collision occurs
while(H->Units[newp].flag != Empty && H->Units[newp].data != Key) {
++Cnum; // Increase collision count by one
if(++Cnum % 2) {
newp = nowp + (Cnum+1)*(Cnum+1)/4; // Increment is +i^2, i is (Cnum+1)/2
if(newp >= H->TableSize)
newp = newp % H->TableSize;
} else {
newp = nowp - Cnum*Cnum/4; // Increment is -i^2, i is Cnum/2
while(newp < 0)
newp += H->TableSize;
}
}
return newp; // Return position, if this position is an empty unit, it means not found
}
```

#### ③ Double Hashing Method#

**d~i~ is i*h~2~(key)**, **h~2~(key) is another hash function**, the probing sequence is h~2~(key), 2h~2~(key), 3h~2~(key)…

- For any key, h~2~(key) ≠ 0!
- The probing sequence should also ensure that all hash storage units can be probed, the following form has good effects:
**h~2~(key) = p - (key mod p)**

where p < TableSize, p and TableSize are both prime numbers.

#### Separate Chaining Method#

The separate chaining method stores all conflicting keywords at the corresponding position **in a singly linked list**. The advantage is that the deletion of keywords does not require "lazy deletion," thus avoiding storage garbage; the disadvantage is that the storage efficiency and search efficiency of the linked list part are relatively low. The uneven length of the linked list can lead to a significant decrease in time efficiency.

### (4) Operation Set of Hash Search (Code)#

Here is the hash search for strings, using the shift method for the Hash function and the quadratic probing method for conflict resolution; others can be modified based on this.

#### Definition#

```
const int MaxSize = 100000;
typedef int Index; // Hash index
typedef string DataType; // Element type stored in hash unit
// Hash unit state type, corresponding to: valid element, empty unit, deleted element
typedef enum {Legitimate, Empty, Deleted} EntryType;
struct HashNode { // Hash table unit type
DataType data; // Store element
EntryType flag; // Unit state
};
struct HashTable { // Hash table type
int TableSize; // Table length
HashNode *Units; // Array storing hash units
};
typedef HashTable *ptrHash;
```

#### Get Prime Number#

Return the smallest prime number greater than N and not exceeding MaxSize, to ensure that the maximum length of the hash table is prime, thus making the address space corresponding to keywords as uniform as possible.

```
// Return the smallest prime number greater than N and not exceeding MaxSize, to ensure that the maximum length of the hash table is prime
int NextPrime(int N) {
int i, p = (N%2) ? N+2 : N+1; // Start from the next odd number p greater than N
while(p <= MaxSize) {
for(i = (int)sqrt(p); i >= 2; i--)
if(!(p % i)) break; // Not a prime number
if(i == 2) break; // for ends normally, it is a prime number
else p += 2; // Try the next odd number
}
return p;
}
```

#### Create Empty Table#

Create an empty table with a length greater than TableSize. (Ensure the length is prime)

```
ptrHash CreateTable(int TableSize) {
ptrHash H;
int i;
H = new HashTable;
H->TableSize = NextPrime(TableSize);
H->Units = new HashNode[H->TableSize];
for(int i = 0; i < H->TableSize; ++i)
H->Units[i].flag = Empty;
return H;
}
```

#### Hash Calculation#

Return the index after calculation by the hash function. Here the keyword data type is string, using the shift method for hashing.

```
Index Hash(DataType Key, int TableSize) {
unsigned int h = 0; // Hash function value, initialized to 0
string str = Key.str;
int len = str.length();
for(int i = 0; i < len; ++i)
h = (h << 5) + str[i];
return h % TableSize;
}
```

#### Search Operation#

Find the Key element, here using quadratic probing to handle conflicts, returning the position index; if this position is an empty unit, it indicates not found.

```
Index Find(ptrHash H, DataType Key) {
Index nowp, newp;
int Cnum = 0; // Record the number of collisions
newp = nowp = Hash(Key, H->TableSize);
// If the unit at this position is not empty and is not the element to be found, a collision occurs
while(H->Units[newp].flag != Empty && H->Units[newp].data != Key) {
++Cnum; // Increase collision count by one
if(++Cnum % 2) {
newp = nowp + (Cnum+1)*(Cnum+1)/4; // Increment is +i^2, i is (Cnum+1)/2
if(newp >= H->TableSize)
newp = newp % H->TableSize;
} else {
newp = nowp - Cnum*Cnum/4; // Increment is -i^2, i is Cnum/2
while(newp < 0)
newp += H->TableSize;
}
}
return newp; // Return position, if this position is an empty unit, it means not found
}
```

#### Insert Operation#

Insert the Key into the table, returning success or failure; failure indicates that the key already exists.

```
bool Insert(ptrHash H, DataType Key) {
Index p = Find(H, Key);
if(H->Units[p].flag != Legitimate) { // This position can insert an element
H->Units[p].flag = Legitimate;
H->Units[p].data = Key;
// Other operations
return true;
} else { // The key already exists
// Other operations
return false;
}
}
```

### (5) Performance Analysis#

The search efficiency of hash tables is analyzed through the following factors:

- Average Successful Search Length (ASLs)
- Average Unsuccessful Search Length (ASLu)

Taking the example in linear probing method, the performance of this hash table is analyzed as follows:

- Its
**ASLs**is the**average number of comparisons for keywords in the search table (i.e., the number of collisions plus 1)**

ASLs = (1+7+1+1+2+1+4+2+4)/9 = 23/9 ≈ 2.56 - Its
**ASLu**is the average number of comparisons for keywords not in the hash table (unsuccessful).

General method: Classify keywords not in the hash table into several categories, such as by their H(key) value; here H(key) = key mod 11, analyze in 11 categories, ASLu as follows:

ASLu = (3+2+1+2+1+1+1+9+8+7+6)/11 = 41/11 ≈ 3.73

The number of comparisons for keywords depends on the number of collisions, and the factors affecting the number of collisions are as follows:

- Whether the hash function is uniform.
- The method of handling conflicts.
- The load factor α of the hash table.

The impact of different conflict handling methods and load factors on efficiency is as follows:

#### ① Search Performance of Linear Probing Method#

#### ② Search Performance of Quadratic Probing and Double Hashing Methods#

When the load factor α < 0.5, the expected number of probes for various probing methods is not large and relatively close. As α increases, the expected number of probes for the linear probing method increases rapidly, and the expected number of probes for unsuccessful searches and insert operations is greater than that for successful searches. Therefore, **the reasonable maximum load factor should not exceed 0.85.**

#### ③ Search Performance of Separate Chaining Method#

#### Summary#

- Choosing a
**suitable h(key)**, the expected search efficiency of the hashing method is constant O(1), which is almost independent of the size n of the keyword space! It is also suitable for problems where direct comparison of keywords is computationally intensive. - It is based on a
**small α**, thus the hashing method is a**space-for-time**method. - The
**storage**of the hashing method is**random**for keywords,**not convenient for sequential search**of keywords, and also**not suitable for range searches**or**maximum and minimum value**searches.