Distributed Election

Distributed Election Algorithm#

Content from

Geek Time Column: “Principles and Algorithm Analysis of Distributed Technology”

Fundamentals of Distributed Systems - Election, Majority, and Lease

Election is a common problem in distributed system practice. By breaking the peer relationship between nodes, the elected leader (also called master or coordinator) helps achieve transaction atomicity and improve decision-making efficiency. The concept of quorum helps us achieve decision consistency in the case of network partitioning and assists in selecting a unique leader in leader election scenarios. The central idea of lease is that only one node can obtain the lease and become the leader during each lease duration, and the lease must be reissued upon expiration, ensuring that there is at most one leader at any given time, thus avoiding the dual-master problem that can arise from network congestion or momentary disconnections when only using heartbeat mechanisms.

Bully Algorithm (Election)#

The Bully algorithm is the most common election algorithm, requiring each node to have a corresponding number, with the node having the highest number being the leader. If the leader crashes, the node with the second highest number is re-elected as the leader.

The Bully algorithm uses the following 3 types of messages during the election process:

Election message, used to initiate an election;
Alive message, a response to the Election message;
Victory message, a declaration of sovereignty sent by the successfully elected master node to other nodes.

Election Process#

Each node in the cluster determines whether its ID is the largest among the currently alive nodes. If it is, it directly sends a Victory message to other nodes, declaring its sovereignty;
If it is not the largest, it sends an Election message to all nodes with a larger ID and waits for responses from other nodes;
If this node does not receive Alive messages from other nodes within a given time frame, it considers itself the master node and sends a Victory message to other nodes, declaring itself the master;
If it receives an Alive message from a node with a larger ID, it waits for other nodes to send Victory messages; if this node receives an Election message from a node with a smaller ID, it replies with an Alive message, informing other nodes that it has a larger ID and that a re-election is needed.

Advantages and Disadvantages, and Application Scenarios#

Advantages:
The Bully algorithm's selection is particularly domineering and simple; whoever is alive and has the largest ID becomes the master node, and other nodes must unconditionally obey. The advantages of this algorithm are its fast election speed, low algorithm complexity, and ease of implementation.

Disadvantages:

It requires each node to have global node information, resulting in a relatively large amount of additional information storage;
Additionally, any new node with an ID larger than the current master node or a node that recovers after a failure may trigger a re-election and become the new master node. If this node frequently leaves and joins the cluster, it can lead to frequent master changes.

Application Scenario:

MongoDB's replica set failover feature

In MongoDB's distributed election, the last operation timestamp of the node is used to represent the ID, meaning the node with the most recent timestamp has the largest ID, thus the most recent and alive node is the master node.

Raft Algorithm (Majority)#

The Raft algorithm is a typical majority voting election algorithm, with the core idea being "the minority follows the majority," where the node receiving the most votes becomes the master.

In the Raft algorithm election, there are 3 roles for cluster nodes:

Leader, the master node, with only one Leader at any given time, responsible for coordinating and managing other nodes;
Candidate, a candidate node, where any node can become a Candidate, and only nodes in this role can be elected as the new Leader;
Follower, the followers of the Leader, which cannot initiate elections.

Election Process#

At initialization, all nodes are in the Follower state.
When starting the election, all nodes transition from Follower to Candidate and send election requests to other nodes.
Other nodes respond to whether they agree to become the master based on the order of received election requests. It is important to note that in each round of elections, a node can only cast one vote.
If the node initiating the election request receives more than half of the votes, it becomes the master node, transitioning its state to Leader, while the states of other nodes change from Candidate to Follower. The Leader node and Follower nodes will periodically send heartbeat messages to check if the master node is alive.
When the Leader node's term ends, i.e., when it detects that other servers are starting the next round of master election, the Leader node's state transitions from Leader to Follower, entering a new round of elections.

Advantages and Disadvantages, and Application Scenarios#

Advantages:

Fast election speed, low algorithm complexity, and ease of implementation.
Stability is better than the Bully algorithm because when new nodes join or nodes recover from failure, it may trigger a master election, but it does not necessarily lead to a master change unless the new or recovered node receives more than half of the votes.

Disadvantages:
It requires that every node in the system can communicate with each other and that more than half of the votes must be obtained for a successful master election, resulting in a large communication volume.

Application Scenario:

Google’s open-source Kubernetes, which uses the etcd component that implements master election and consistency using the Raft algorithm.

ZAB Algorithm (Majority)#

The core of the ZAB election algorithm is "the minority follows the majority, with nodes having larger IDs prioritized to become the master." Therefore, during the election process, (vote_id, vote_zxID) is used to indicate which node is being voted for, where vote_id represents the ID of the voted node, and vote_zxID represents the server zxID of the voted node. The principle for selecting the master in the ZAB algorithm is that the server_zxID with the largest value becomes the Leader; if server_zxID values are the same, the server_id with the largest value becomes the Leader.

When using the ZAB algorithm for elections, each node in the cluster has 3 roles:

Leader, the master node;
Follower, follower nodes;
Observer, observers without voting rights.

In the election process, the nodes in the cluster have 4 states:

Looking state, the election state. When a node is in this state, it believes there is currently no Leader in the cluster, thus entering the election state.
Leading state, the leader state, indicating that a master has been elected and the current node is the Leader.
Following state, the follower state, where after a master has been elected, the states of other non-master nodes update to Following, indicating their following of the Leader.
Observing state, the observer state, indicating that the current node is an Observer, taking a wait-and-see attitude without voting or election rights.

Election Process#

When the system starts, all 3 servers are currently voting in the first round, i.e., epoch=1, and zxID are all 0. At this time, each server nominates itself and broadcasts the voting information.
Based on the judgment rules, since the epochs and zxIDs of the 3 servers are the same, the server_ids are compared, and the larger one is the nominated object. Therefore, Server 1 and Server 2 change their vote_id to 3, update their voting boxes, and re-broadcast their votes.
At this point, all servers in the system have nominated Server 3, so Server 3 is elected as the Leader, in the Leading state, sending heartbeat messages to other servers and maintaining connections; Server 1 and Server 2 are in the Following state.

Advantages and Disadvantages, and Application Scenarios#

Advantages:

High algorithm performance, with no special requirements for the system.
The stability of the election algorithm is relatively good; when new nodes join or nodes recover from failure, it may trigger a master election, but it does not necessarily lead to a master change.

Disadvantages:

Using a broadcasting method to send information, if there are n nodes, each node broadcasts simultaneously, resulting in n*(n-1) messages in the cluster, which can easily lead to a broadcast storm (similar to the "signaling storm" in distributed mutual exclusion).
In addition to voting, it also adds comparisons of node IDs and data IDs, meaning that all node IDs and data IDs must be known, resulting in relatively longer election times.

Application Scenario:

Designed for implementing distributed coordination functions for ZooKeeper.

Summary Analysis#

Comparison of Three Algorithms#

Mind Map Summary#

Thought Expansion#

1. Why are "majority" master election algorithms usually implemented with an odd number of nodes rather than an even number?#

There may be cases where two nodes receive half the votes each, and with an even number of nodes, it is impossible to elect a master, necessitating a re-vote. However, even with a re-vote, the probability of the two nodes having the same number of votes remains high.

2. What is the relationship between distributed election and consistency?#

Consistency refers to whether multiple replicas can have the same value at the same time. Distributed elections are often the basis for achieving consistency by electing a master node to coordinate and manage other nodes, ensuring the orderly operation of other nodes and consistency between them.

3. Is there a scenario where a cluster has dual masters?#

The dual-master situation generally arises from network failures, such as network partitioning (where the cluster forms two separate networks due to network disconnection). During the period of dual masters, if both masters provide services, it may lead to data inconsistency within the cluster. Therefore, it is necessary to decide whether to allow service provision under dual-master conditions based on the business's tolerance for data inconsistency.

Associative Memory: The phenomenon of brain splitting in clusters, reference: