Collisions in hash tables and how to resolve them

Click Here to view videos related to this article.
- X
Collisions in Hashing - Rob Edwards
๐Ÿ”—
Click here to suggest a better video or view suggestions..

Hash tables are incredibly useful data structures that allow us to store and retrieve information very quickly. They use a special function called a hash function to assign each piece of data (identified by a unique key) to a specific slot or index in an array. Ideally, every key gets its own unique slot.

However, what happens when the hash function assigns the same slot to two different keys? This situation is called a collision, and it’s a common challenge when working with hash tables. Understanding collisions and how to handle them is crucial for keeping hash tables efficient.

What is a Hash Collision?

Imagine you have a set of numbered lockers (the hash table slots) and you need to assign lockers to students (the keys). You have a simple rule (the hash function) to decide which locker each student gets based on their name. For example, maybe you use the first letter of their name to assign a locker number.

Now, what if two students, say “Alice” and “Alex,” both need a locker? According to your rule, both their names start with ‘A’, so they might be assigned the same locker number. This is exactly what a hash collision is: when two or more different keys produce the same hash value, leading them to be mapped to the same index in the hash table.

  Key      Hash Function      Index
"Alice"  --> hash("Alice")  -->  3
"Bob"    --> hash("Bob")    -->  1
"Charlie"--> hash("Charlie")-->  4
"Alex"   --> hash("Alex")   -->  3  <-- COLLISION! Both "Alice" and "Alex" map to index 3.

Collisions mean we can’t simply place the new item directly into the calculated slot because it’s already occupied by something else. We need a strategy to manage this.

Why Do Collisions Happen?

Collisions are almost unavoidable in most real-world scenarios for a couple of main reasons:

  1. Limited Slots: Hash tables typically have a fixed number of slots (the array size). If you have more keys than available slots, the Pigeonhole Principle dictates that at least one slot must hold more than one key. Even if you have fewer keys than slots, collisions can still occur.
  2. Imperfect Hash Functions: Ideally, a hash function would spread keys perfectly evenly across all available slots. However, creating such a perfect hash function is very difficult (and sometimes impossible if you don’t know all the keys beforehand). Most hash functions will inevitably map some different keys to the same index.

Note

Even with a good hash function that distributes keys well, the possibility of collisions increases as the hash table fills up (i.e., as the load factor increases).

How to Resolve Collisions?

Since collisions are expected, hash tables need mechanisms to handle them gracefully. The goal is to still be able to store and retrieve all keys, even if they initially hash to the same index. There are two primary approaches to collision resolution:

  1. Separate Chaining:

    • Idea: Instead of storing just one item at each slot, each slot holds a pointer to another data structure, usually a linked list (or sometimes a balanced tree).
    • How it works: When a collision occurs at an index, the new key-value pair is simply added to the list (or other structure) at that index. To find an element, you hash the key to find the correct slot, and then search through the list at that slot. jsondsav
  2. Open Addressing:

    • Idea: All key-value pairs are stored directly within the hash table array itself.
    • How it works: When a collision occurs at an index, the algorithm probes (checks) other slots in the table according to a predetermined sequence until an empty slot is found. The new key-value pair is then inserted into that empty slot. Different probing sequences exist, such as:
      • Linear Probing: Check the next slot sequentially (index + 1, index + 2, etc.).
      • Quadratic Probing: Check slots based on a quadratic sequence (index + 1^2, index + 2^2, etc.).
      • Double Hashing: Use a second hash function to determine the step size for probing. jsondsav

Tip

The choice between Separate Chaining and Open Addressing depends on various factors, including memory usage patterns, the expected load factor, and the importance of deletion operations (which can be simpler with Separate Chaining).

What’s Next?

Now that you understand what collisions are and why they happen, the next step is to explore the practical methods used to manage them. Discover how different techniques ensure hash tables remain efficient even when multiple keys map to the same slot:

  • Separate Chaining: Learn how collisions are handled by creating lists (or chains) of items that hash to the same slot.
  • Open Addressing: Discover techniques where colliding items are placed in other empty slots within the table itself.
    • Linear Probing: Understand the simplest open addressing method where we check consecutive slots.
    • Quadratic Probing: Explore how checking slots using a quadratic sequence can help reduce clustering.
    • Double Hashing: See how using a second hash function improves the distribution of probed slots.
  • Comparing Techniques: Analyze the trade-offs between separate chaining and various open addressing methods.