The Disjoint-Set data structure, also known as Union-Find, manages elements in non-overlapping subsets. It's primarily used for graph connectivity problems, offering two operations: Union, to merge sets, and Find, to identify a set's leader. This structure excels in tasks involving elements initially represented as separate sets, enabling efficient set combination and connectivity checks between elements. It's defined by disjoint sets (non-intersecting) and the Union-Find algorithm, which determines subset membership and merges subsets. The Disjoint-Set structure is crucial for problems requiring frequent set merges and connectivity queries, due to its near-constant-time operations.
Whenever we have been given a problem that contains n elements represented as separate sets initially and we need to the perform following operations:
Disjoint Set: Two sets are called disjoint set if the intersection of two sets is ϕ i.e. NULL. A Disjoint-Set data structure that keeps records of a set of elements partitioned into several non-overlapping (Disjoint) subsets.
Union Find: Union-Find algorithm performs two very useful operations -
Find: To find the subset a particular element 'k' belongs to. It is generally used to check if two elements belong to the same subset or not.
Union: It is used to combine two subsets into one. A union query, say Union(x, y) combines the set containing element x and the set containing element y."
In this example, we are going to see how the union queries (series of Union operations) work in a Disjoint Set .
Let's assume we have 7 nodes initially, we will store them in the form of trees.Where each tree corresponds to one set and the root of the tree will be the parent/leader of the set.
At first, you can see that all the seven nodes are parents of themselves. In simple words we can say, we have seven different trees containing 1 element each i.e. the root of the tree.
Now we will perform the following union queries one by one on them --
The most basic implementation of the Disjoint set one can think of is to keep a track of the parent of every element present in a list/array.
We will have a global array/list of size n where 'n' is the number of elements in the set.
Finding the parent of an element 'u' is very easy, since we are keeping track of parent of every element. If parent[u]=u then we will return parent[u] in the other case we will recursively find the parent of parent[u] until the condition parent[u]=u satisfies.
For merging two sets, Let we need to perform Union( u, v ) then
And that's it, yes you read it right this is the pseudocode for implementing Disjoint Set data structure, but it is not efficient for large inputs.
This approach is not much efficient, so to get the full-fledged advantage of Disjoint Set Union we will look for some minor modifications that can make our code efficient.
Let u be a node and p be its parent. When we are finding parent of u recursively, we are also finding parent of all those intermediate nodes which we visit in the path u → p .
The trick here is to shorten the path to the parent of all those intermediate nodes by directly connecting them to the leader of the set.
You can see that in the process of finding the parent of node 8 we have attached all the intermediate nodes to 1 .
To achieve this, we just need to modify our find function a bit and that minor modification is - During the recursive call of finding parent of parent[u] we will also pass the result to parent[u] (please refer to the last line of underlying pseudocode for better understanding)
This is firstly finding the root of the set and in the process of stack unwinding, we are attaching all the intermediate nodes to their representative. Yes by modifying that line we reduced the time complexity from O ( n ) O(n) O ( n ) to O ( l o g ( n ) ) O(log(n)) O ( l o g ( n ) )
In this optimization, we would be very much careful about which tree is getting attached to the other one. In the naive implementation, the second tree is always getting attached to the first one which could lead the structure to become like the skewed tree which would have linear chains of length O(n) .
So what we will do is, we will maintain a record of the depth of each tree (Rank) and during the union operation we will attach the tree with lower rank to the tree with higher the rank thus Rank of the final tree (after union operation) will remain the same.
In case, if the Rank of both trees is equal we will join any one of them with the other and will increase the rank by 1 of the tree to which another one is joined.
Suppose we need to perform on give two trees - Tree 1 -
Tree 2 -
If we combine them by checking their rank then we would find rank [tree1]=2 and rank[tree2]=1 then, the combined tree would be - We can see that since tree 2 is attached to tree 1 so, the rank of the combined tree is same as tree 1.
If combine Tree 1 and Tree 2 without checking their rank then the combined tree would look like -
Now we can observe attaching tree 1 to tree 2 is not a good idea as the depth of the combined tree has been increased by 1 which would cost us more time to find the leader of the tree.
The change in algorithm of union function is as follows:
By doing this modification alone (i.e. without Path compression) we can reduce the time complexity to O ( l o g ( n ) ) O(log(n)) O ( l o g ( n ) ) .
In optimization 1, we have seen how can we compress the path by directly attaching all the intermediate nodes to the leader of the set during the process of finding the parent of node u the process is termed as "Path Compression".
In optimization 2, we have seen that how can we minimize the depth of trees by attaching lower rank trees to trees with higher ranks while merging them. The process is termed as "Union by Rank"
So merging optimization 1 and 2 seems to be a good idea, isn't it? The most optimal approach is the same only i.e. merging Optimization 1 and 2 . If we implement the Find function using Path Compression and the Union function using Union by rank we can achieve the most optimized algorithm.
For implementing we just need two global arrays i.e. parent and rank of size n each to maintain parent and rank of every node present. For Find function we will use the pseudocode used in Optimization 1 and for Union function we will use pseudocode used in Optimization 2.
To make our code more modular let's make a class to implement D S U DSU D S U with following data members and methods -