Prefix Tree (Trie) Data Structure

Intro to Trie's
🔗
Basics of Trie Data Structure
🔗
Lecture on Trie Data Structure
🔗

A trie, also called as prefix search tree, is a data structure used to store a set of words or strings. It can be used to efficiently store a large number of words and quickly find if a particular word/string is present in the trie.

Tries or prefix trees can also find all the stored words for a particular prefix efficiently. For example if the words [“cat”, “cup”, “cast”] are stored in the trie, given a prefix string like “ca”, the trie can quickly find and return all the words with that start with “ca”: [“cat”, “cast”]. This is especially useful for implementing auto completion features similar to google search.

Note: is any word that is made of one or more consecutive characters from the start of the reference word. For example, prefix words of the word “top” are “t”, “to” and “top”.

Anatomy of a Trie Data Structure

A Trie or prefix tree is a rooted tree made up of multiple nodes. Each node contains zero or more links where each link correspond to a unique character. Words are stored in the trie by forming links corresponding to each character in the word. Let us consider a simple scenario where only the word “car” is stored in the trie. Below is how the trie looks:

In the above example trie, the root node “*” has a link to character node “c”. Character node “c” in turn has a link to character node “a” and node “a” in turn has a link to character node “r”. So if we traverse from the root node to the leaf and note down the characters, we get the word “car” (c->a->r).

Now let’s add the word “cat” to the previous trie structure. Since the nodes corresponding to the prefix c -> a already exist in the trie, the same nodes & links are re-used and an extra link to character “t” is added after character “a”. Below is how the trie looks after adding the word “cat”.

So now we can traverse the trie/prefix tree from root node to leaf nodes via two paths: c ->a->r which corresponds to the word “car” and c->a->t which corresponds to the word “cat”.

Now let’s try adding a word that starts with a word already added to the trie. For example, let’s add the word “card” which has a prefix “car” in it. (The word “car” is already stored in the trie.)

The paths from root node to leaf (end) node in the above trie are c->a->p corresponding to the word “cap”, c->a->r->d corresponding to the word “card”, c->a->t corresponding to the word “cat”. But how do we confirm that the word “car” exists in the trie? In order to achieve this, we add a marker at each node corresponding to ending character of the inserted word. Below is how the trie looks after marking the word endings:

So now all the valid paths from root node to any marked node in the above trie are: c->a->p, c->a->r, c->a->r->d, c->a->t corresponding to the words “cap”, “car”, “card”, “cat” respectively.

Summary: Tries are made up of multiple nodes and each node can link to zero or more unique characters. The words stored in the trie are found by travelling from the root node to any node which has a mark indicating that a valid word ends at that node.

TODO: Interative words inserter & Fun exercise.

Operations on a Trie Data Structure

Add a word to the trie

This operation adds/inserts any given word into the trie. Below is how we insert any word into the trie’s prefix tree:

  • Start from the first character in the given word and move towards the last character
  • For each character, travel one step down the tree from current node to the node corresponding to current character
  • If the link corresponding to a character does not exist for a node, create it and travel down
  • Once we reach the node corresponding to last character in the word, mark the node as a valid word ending

The above algorithm might be difficult to understand at first. In order to make it easier to understand, let’s start with an empty trie and look at the animation of how the animation of how insertion of a couple of words into the trie look like.

Let’s add the word “cat” first. No sequence of nodes corresponding to any prefix of the word “cat” exist in the trie. So we keep inserting links character by character and move down the trie till the last character “t” and then mark it.

Now let’s add the word “dog” to the trie. As with the previous word, no common prefix nodes exist for the word “dog”, so we keep inserting links character by character and move down the trie and then mark the last character “g”.

Now let’s add the word “cart” to the trie. For this word “cart”, the nodes corresponding to the prefix “ca” already exist in the trie. So we reuse them and add links to the last missing characters “r” and “t” to the sequence of links and then mark the last character.

Now let’s add the word “car” to the trie. For this word “car”, the sequence of nodes corresponding to all the characters exist in the trie. So we travel via the nodes “c->a->r” and since the node for the last character “r” is not marked, we mark it.

Find if a word is present in the Trie

This operation searches the trie for existence of a given word. Below is the algorithm used to search for a word:

  • Start from the first character in the given word and move towards the last character
  • Start from the root of the prefix tree and travel down the tree by following the nodes corresponding to each character
  • If a particular node corresponding to a character is missing, it means the word is not present in the trie
  • If we are able to visit nodes corresponding to all the characters and last node has a mark on it, it means that the word is present in the trie.

The above algorithm might be difficult to understand at first. In order to make it easier to understand, let’s look at the animation of searching for three cases.

First let’s search for the word “cab”. While traversing the prefix tree starting from the root node, we are able to travel through the links “c -> a”, but the link to “b” is missing. So we conclude that the word “cab” is not present in the trie.

Now let’s search for the word “car”. While traversing the prefix tree starting from the root node, we are able to travel through all the links “c -> a -> r” but the node corresponding to last character is not marked as a valid word end. So we conclude that the word “car” is not present in the trie.

Let’s search for the word “card” now. While travarsing the prefix tree starting from the root node, we are able to travel through all the links c -> a -> r -> d. Also the node corresponding to the last character “d” is marked as a valid word end. So we conclude that the word “card” is present in the trie.

Delete a word from the Trie

This operation deletes a word from the trie. The algorithm to delete a word from the trie is similar to the search algorithm. First we search for the word to delete; if the word is present, we remove the mark on the node corresponding to the last character. Also the nodes which are used only for the deleted word are also removed from the trie.

TODO: Animation of deleting a word from the trie.

List all words that start with a word (prefix)

This operation lists all the words present in the trie that start with a particular prefix. This is achieved by travelling down the tree following all the characters in the prefix word and then traversing through all the possible paths down the tree. For example given the below prefix tree, for “ca”, all the possible paths are: c->a->r->d & c->a->s->t

Implementing Trie Data Structure


// Implementation of Trie Data Structure in C++
#include <iostream>
#include <string>
#include <unordered_map>
#include <memory>
#include <vector>

struct TrieNode {
  TrieNode(){}
  bool validEnd = false;
  std::unordered_map<char, std::unique_ptr<TrieNode>> childnodes;
  
  void insert_word(const std::string &word){
    TrieNode *cur_node = this;
    for(char c : word){
      if(cur_node->childnodes.find(c) == cur_node->childnodes.end()){
        cur_node->childnodes[c] = std::make_unique<TrieNode>();
      }
      cur_node = cur_node->childnodes[c].get();
    }
    cur_node->validEnd = true;
  }
  
  bool find_word(const std::string &word){
    TrieNode *cur_node = this;
    for(char c : word){
      if(cur_node->childnodes.find(c) == cur_node->childnodes.end()){
        return false;
      }
      cur_node = cur_node->childnodes[c].get();
    }
    return cur_node->validEnd == true;
  }
  
  void delete_word(const std::string &word){
    TrieNode *cur_node = this;
    for(char c : word){
      if(cur_node->childnodes.find(c) == cur_node->childnodes.end()){
        return; // We fell off the tree
      }
      cur_node = cur_node->childnodes[c].get();
    }
    cur_node->validEnd = false;
  }
};


// Let's now fill the trie with some words
// And test the operations
int main(){
  TrieNode trie;
  std::vector<std::string> words_to_insert = {
    "car",
    "bike",
    "cart",
    "plane"
  };
  for(auto &word : words_to_insert){
    trie.insert_word(word);
  }
  
  // Let's search for some words
  std::vector<std::string> words_to_search = {
    "ship",
    "plane",
    "car",
    "plan"
  };
  
  for(auto &word : words_to_search){
    std::cout << "The word " << word << " ";
    if(trie.find_word(word)){
      std::cout << "is found!";
    }
    else{
      std::cout << "is not found.";
    }
    std::cout << std::endl;
  }
  
  // Let's delete a word and see if it works
  std::cout << "Deleting the word plane.." << std::endl;
  trie.delete_word("plane");
  std::cout << "The word plane is ";
  if(trie.find_word("plane")) std::cout << "found!" << std::endl;
  else std::cout << "not found." << std::endl;
  
  return 0;
}

Applications of Trie / Prefix Search Tree

  • Tries are used to quickly find all the words that start with a particular prefix. For example, while finding contacts in your phone, just by typing the first few characters, all the possible numbers are retrieved
  • They are used in search engines to suggest phrases that start with a particular pattern
  • Used to quickly find if any given word is present in a text within time proportional to the size of word to search
  • Tries are also used as an alternative to hash tables by storing values at nodes corresponding to each word end
  • In compression algorithms

References