Disjoint Sets in C++

ยท

6 min read

This Data-Structure deals mainly with Sets and their operations. Why is that important? Because we can accomplish operations in near O(1) time.

This DS mainly covers these operations:

  1. Union

  2. Find

Due to these operations, this DS is also called as Union-Find DS.

Union Find

The above picture is a representation of 7 sets. The sets rounded are Union to each other. For example, set-4 and set-5 are Disjoint Sets whereas set-1 and set-8 are 'Union'.

There are usually two implementations of this DS.

  1. Naive.

  2. Compressed.

Naive Approach

This takes O(n) time. We'll have a parent array where each index represents a set and the value of that index represents the set it is joined to. When creating a set, the parent(value at index) of that set is itself. It is actually a tree represented in the form of an array. For example, the array(1-indexed) for the above picture looks like:

{ 1, 1, 3, 3, 1, 1, 7, 1 }.

Naive Implementation

void make_set(int vertex) {
    parent[vertex] = vertex; // As said, the parent of that set is itself while creating a set.
}


// O(n) time.
int find_set(int vertex) {
    if (vertex == parent[vertex]) return vertex; // If current vertex is its parent, we return vertex.

    return find_set(parent[vertex]); // Else, we go up to it's parent one by one, just like climbing a tree.
}

void union_sets(int set_a, int set_b) {
    set_a = find_set(set_a); // Find which set a belongs to.
    set_b = find_set(set_b); // Find which set b belongs to.
    if (set_a != set_b)      // Is they're equal, they belong to the same set already. Else, set parent of any one of these 2 sets to other.
        parent[set_b] = set_a;
}

So, why is this naive? Because, the distance between the parent and children is not optimized. In the worst case, it may take even up to N calls to find its parent.

So, we compress the paths between parent and child to reduce the distance between them so that no. of steps taken is as minimal as possible.

Compressed Approach

This approach takes O(logN) for find operation. The only change we'll be doing is that we'll first find the root, then while returning back we start to attach them directly to the representative.

int find_set(int vertex) {
    if (vertex == parent[vertex]) return vertex;

    return parent[vertex] = find_set(parent[vertex]);
}

In the Union operation of Naive implementation, we can see that we attach the second set to the first set. This may sometimes lead to a long chain in the same direction. To resolve this, we can consider the depth or rank of a vertex and if one is greater than another, we swap them. This makes sure that the tree is not one-sided.

This code does that.

// Union By Size.
void union_sets_by_sz(int set_a, int set_b) {
    set_a = find_set(set_a);
    set_b = find_set(set_b);
    if (set_a != set_b) {
        if (sz[set_a] < sz[set_b]) swap(set_a, set_b); // If size is less than another, swap them.

        parent[set_b] = set_a;
        sz[set_a] += sz[set_b];
    }
}


// Union By Rank.
void union_sets_by_rank(int set_a, int set_b) {
    set_a = find_set(set_a);
    set_b = find_set(set_b);
    if (set_a != set_b) {
        if (rnk[set_a] < rnk[set_b]) swap(set_a, set_b); // If rank (depth) is less than another, swap them.

        parent[set_b] = set_a;
        if (rnk[set_a] == rnk[set_b]) ++rnk[set_a];
    }
}

Both of these operations cost the same time and space.

But, how can we achieve near O(1) time as said? We can combine both Compressed Union and Compressed Find to get there. For initial queries, the time will be O(logN), after that it will reach O(1). This will be really efficient when we have a large number of queries.

Since it is so fast, Compressed Union and Find are also called as Fast Union-Find.

The Naive and Compressed trees differ so:

DS Compressed vs Naive
Complete Code