Tarjan's Strongly Connected Components Algorithm

Tarjan's SCC Algorithm By William Fiset
ðŸ”—

Tarjan’s algorithm is a popular and efficient algorithm for finding strongly connected components (SCCs) in a directed graph. The purpose of this algorithm is to efficiently identify the subsets/groups of vertices in a directed graph where each vertex in a subset is reachable from every other vertex within the same subset.

What are Strongly Connected Components?

Strongly connected components (SCCs) is a concept in graph theory used to describe subgraphs of a directed graph where every vertex in a subgraph is reachable from every other vertex within the same subgraph. In simpler terms, an SCC is a portion of the graph where you can travel between any two nodes in both directions without leaving the component. These components are important because they help to identify sections of a graph that are tightly interconnected, which can be useful for understanding the structure and behavior of networks, such as social networks or web pages.

Below is an illustration of a graph with each of its strongly connected components labeled:

Detailed Steps of Tarjan’s SCC Algorithm

Tarjan’s algorithm performs a depth-first search (DFS) on the graph and keeps track of two important values for each vertex:

• Discovery Index: A unique integer assigned to each vertex in the order it is first visited during the DFS traversal.
• Low Link Value: The minimum discovery index of any vertex reachable from the current vertex through a path of edges in the graph.

Consider the graph shown below to understand the concept of discovery index of a node and low link value of a node.

Let’s say we perform the DFS starting from node `A`. This is one of the order in which nodes are visited/discovered: `A`, `B`, `C`, `D`. Since node `A` is discovered first, it is assigned an index `0`. Next node `B` is visited, so it is assigned the next index `1` and for node `C` the discovery index will be `2` and so on. Below is the graph with discovery indices (`D-Index`) of each node marked:

Now, low link value of each node is the lowest discovery index of any node reachable from that node. For example, if you consider node `B`, it can reach node `D` with discovery index `3`, node `C` with discovery index `2` and node `A` with discovery index `0`. Since `0` is the lowest index among all reachable nodes from node `B`, it’s low link value is `0`. Below is how the graph looks like after low link values (`LL-Value`) for all the nodes are marked:

With an understanding of what discovery indices and low link values are, let’s examine the steps in Tarjan’s algorithm:

Step-1: Initialization:

• Set up a counter for assigning discovery indices, starting from zero.
• Initialize discovery indices and low link values for all vertices to undefined (or a special value like -1 to indicate they are unvisited).
• Prepare a stack to keep track of vertices that are still not identified as part of any SCC.
• Initialize a boolean map/set to keep track of whether a vertex is currently on the stack.

Step-2: Depth-First Search Traversal:

• For each vertex in the graph:
• If the vertex is unvisited:
• Call the recursive DFS function on the vertex.

Recursive DFS Function:

• Assign the current counter value as both the discovery index and the initial low link value of the vertex.
• Increment the counter.
• Push the vertex onto the stack.
• Mark the vertex as being on the stack.
• For each vertex directly connected (adjacent) to the current vertex:
• If the adjacent vertex is unvisited, recursively apply the DFS function to it.
• After returning, update the current vertex’s low link value:
`low_link[v] = min(low_link[v], low_link[adj_node])`
• If the adjacent vertex is visited and is on the stack:
• Update the current vertex’s low link value: `low_link[v] = min(low_link[v], low_link, index[adj_node])`
• After exploring all adjacent vertices, check for SCC formation:
• If the vertex’s low link value equals its discovery index:
• Initalize a new SCC.
• Pop vertices from the stack and add it to the new SCC until the current vertex is reached.
• Mark each popped vertex as no longer being on the stack.
• The popped vertices form a strongly connected component.

The algorithm efficiently identifies SCCs in a single depth-first search pass, making it particularly useful for large, complex graphs where performance is crucial.

Implementation

``````
def dfs(graph, node, counter, index, low_link, stack, on_stack, sccs):
"""
Perform Depth-First Search (DFS) to find Strongly Connected Components (SCCs)

Args:
graph: Adjacency list representation of the graph
node: Current node being processed
counter: Counter for assigning discovery indices
index: Mapping of node to its index in the DFS
stack: Stack to keep track of nodes in the DFS search tree
on_stack: Set of nodes currently on the stack
sccs: List to store the found Strongly Connected Components
"""
index[node] = counter
counter += 1
stack.append(node)

# Iterate over all neighbors of the current node
for neighbor in graph[node]:
if neighbor not in index:
# If the neighbor hasn't been visited, perfom dfs on it's subtree
dfs(graph, neighbor, counter, index, low_link, stack, on_stack, sccs)
elif neighbor in on_stack:

# Check if the current node is a root node for an SCC
scc = []
# Pop all nodes from the stack upto
# the root node and add them to current SCC
while stack:
top = stack.pop()
on_stack.remove(top)
scc.append(top)
if top == node:
break
sccs.append(scc)

def tarjan_scc(graph):
"""
Find the Strongly Connected Components (SCCs) in the given graph using Tarjan's algorithm.

Args:
graph: Adjacency list representation of the graph

Returns:
list: List of Strongly Connected Components (SCCs)
"""
counter = 0 # Counter to set discovery indices
index = {} # Map to store discovery index of each node
low_link = {} # Map to store low-link value of each node
stack = [] # Stack to store nodes which yet to be mapped to an SCC
on_stack = set() # Set to store nodes which are present on stack
sccs = [] # List to store all SCCs in the given graph

for node in graph:
if node not in index:
# Current node is not visited yet
# So let's perform a DFS and group nodes into SCC's
dfs(graph, node, counter, index, low_link, stack, on_stack, sccs)

return sccs

graph = {
"A" : ["D", "B"],
"B" : ["C"],
"C" : ["A"],
"D" : ["I"],
"E" : ["D", "F"],
"F" : ["G"],
"G" : ["H"],
"H" : ["E"],
"I" : ["J"],
"J" : ["I"]
}

print("List of all SCCs in the graph:")
sccs = tarjan_scc(graph)
print(sccs)
``````

Time and Space Complexity

Time Complexity

The time complexity of Tarjan’s algorithm is `O(V + E)`, where `V` is the number of vertices (nodes) and `E` is the number of edges in the graph. Below is the breakup of how we arrived at this time complexity:

• DFS Traversal: The algorithm performs a depth-first search on the graph. Each node is visited exactly once, and each edge is explored once. The DFS itself takes `O(V + E)` time.

• Index and Low-Link Calculations: For each node, the algorithm assigns an index and a low-link value. These operations are constant time operations and it takes a total of `O(V)` time for all the nodes.

• Stack Operations: Nodes are pushed to and popped from a stack, which takes `O(1)` time per operation. Each node is pushed and popped from the stack exactly once during the algorithm, contributing to the overall `O(V)` time.

• Checking and Updating Low-Link Values: The algorithm updates low-link values for nodes based on their neighbors. Each such update is an `O(1)` operation, and these updates are performed as part of the DFS, contributing to the linear time complexity.

The overall time complexity comes out to be `O(V + E)`.

Space Complexity

The space complexity of Tarjan’s algorithm is also `O(V)`, where `V` is the number of vertices (nodes) in the graph. Below is the breakup of how we arrived at this space complexity:

• Recursion Stack: The depth-first search is recursive, and in the worst case, the recursion stack could grow to contain every node in the graph, leading to `O(V)` space usage.

• Index and Low-Link Arrays: These are arrays (or dictionaries) that store an integer value for each node, resulting in `O(V)` space.
• Stack: Used to keep track of nodes in the current search path, with a maximum size of `V` if all nodes are on the stack at the same time.
• On-Stack Set: Keeps track of which nodes are currently in the stack, with space complexity `O(V)` since it can contain every node.
• SCC List: Stores the strongly connected components found in the graph. In the worst case, this can also take `O(V)` space, assuming every node is in a separate SCC.

The overall space complexity comes out to be `O(V)`, which means that space required is directly proportional to the number of nodes in the graph.

Intuition Behind the Algorithm

In order to understand how Tarjan’s algorithm works and the intuition behind it, let’s consider the graph below which has 4 Strongly Connected Components:

You can think of the above graph as a component tree where there are specific entry points for each Strongly connected component. You can enter `SCC2` via `SCC1` or `SCC4`. Similarly you can enter `SCC3` via `SCC2`.

Now if you perform DFS from any node present in `SCC1`, the nodes present in `SCC1` may be fully or partially explored before moving to `SCC2` and then the nodes in `SCC3` are explored.

Our goal is to separate out strongly connected components, so what is some pattern in DFS which we can use to identify SCCs?

The pattern is that all nodes present in ending/terminal SCC are visited before returning back to older/inner SCCs. In the above example, DFS returns back to `SCC2` only after all the nodes in `SCC3` are visited. Also, visiting of nodes in `SCC3` start from node `I` and the exit of DFS from `SCC3` also happens via the same node. So in terms of visit times, all nodes in `SCC3` will have visit times more than the visit time of node `I`.

Tarjan’s algorithm uses this principle to identify the source node from which an SCC starts (node `I` for `SCC3`) by tracking the discovery time/index of each node along with low link value. Whenever a node is encountered which has the same discovery time and low link value, it means the current node is a source node for the current component. So all nodes present on the stack until the source/current node belong to one component.

If you are still not clear on why the algorithm works, you can check out this notes from Caltech and this from CMU on strongly connected components.