Monday 3 October 2016

Election algorithms for clustered software

The problem

I’ve recently been looking at a problem with some software that was written to work in a cluster. This particular software service runs background jobs against a SQL Server database and in order to support fail-over scenarios the software was written to work in a cluster. Only one service instance (the master) was actually doing any work at any given time with the other instances (the slaves) providing redundancy in the case of a failure. In other words, one instance would be nominated as the master and would take responsibility for running the background jobs. If the master crashed or became unavailable one of the other instances in the cluster would take over as master.

From now on I’m going to continue to use the term service instance to describe a software comonent that participates in a cluster. Each service instance is probably a separate process.

The problem was that the mechanism used to elect and monitor the master was based on UDP broadcast, and broadcast is something that can be problematic in cloud-based environments such as AWS. Given there was a need to migrate this service to the cloud this was a significant issue.

At a high level the election algorithm being used by the cluster was for service instances to use UDP broadcast to exchange messages between themselves to agree which instance would be the master. Once the master had been nominated it took over the work of running the background jobs. The other service instances would then periodically poll the master to check that it was still alive. The first instance to find the master to be unavailable would claim the master role, take over the responsibility of running the jobs and broadcast the change in master.

The use of UDP broadcast in this context was useful because it meant that service instances didn’t need to know about each other. To use more direct addressing it would be necessary to store the addresses of all instances in the cluster in some form of registry or configuration. Configuration management across multiple environments is itself a challenge so reducing the amount of configuration can be an advantage.

However, in this case the use of UDP broadcast was an issue that needed to be addressed to facilitate a move to the cloud. This provided a good opportunity to review clustering election patterns and approaches to writing clustered software in general to see what options are available.

Note: There are alternatives to creating writing software that behave as a cluster natively (e.g. ZooKeeper). This article does not deal with these alternative approaches but focuses on the creation of natively clustered software.

Reasons for clustering

There are typically 2 reasons for writing software that supports clustering:

  • Failover – to prevent outages it would be advantageous to build in redundancy so that if one service instance crashes there’s another available to take up the slack. Note that in this case it isn’t necessary for all instances to be doing useful work. Some may be on stand-by, available to take over if the primary fails but not doing anything while the primary is active.
  • Performance – to facilitate greater application performance running separate software instances (probably on separate servers) may be advantageous. In this case work can be distributed between instances and processed in parallel.

 

Of course, these two aspects are not mutually exclusive; a cluster may support both high availability and distributed processing.

Characteristics of clustered software

Typically when running software as a cluster one instance will be nominated as the coordinator (leader or master). Note that this instance does not have to perform the work itself, it may choose to delegate the work to one of the other instances in the cluster. Alternatively – such as in our example above – the coordinator may perform the work itself exclusively.

This is somewhat analogous to server clustering which can be either symmetrical or asymmetrical. In the symmetrical case every server in the cluster is performing useful work. To distribute work between the servers in the cluster a load balancer is required. In the case of a software cluster it’s the instance elected as the coordinator that’s probably performing this task.

In the asymmetrical case only one server will be active with the other server instances in the cluster being passive. A passive instance will only be activated in the event of a failure of the primary. In the case of a software cluster the coordinator would be the active instance with other instances being passive.

Whichever basic topology is chosen it will be necessary for the software cluster to elect a coordinator when the cluster starts. It will also be necessary for the cluster to recognise when a coordinator has crashed or become unavailable and for this to trigger the election of a new coordinator.

When designing a system like this care should be taken to avoid the coordinator becoming a bottleneck itself. There are also other considerations. For example, in auto-scaling scenarios what happens if the coordinator is shut down as a result of downsizing the infrastructure?

Election patterns

How do software clusters go about managing the election of a coordinator? Below is a discussion of 3 possible approaches:

  • Distributed mutex – use a shared mutex is made available to all service instances and is used to manage which instance is the coordinator. Essentially, all service instances race to grab the mutex. The first to succeed becomes the coordinator.
  • Bully algorithm – use messaging between instances in the cluster to elect the coordinator. The election is based on some unique property of each instance (e.g. a process identifier). The process with the highest value ‘wins’. The winning instance bullies the other instances into submission by keeping the mutex and claiming the coordinator role.
  • Ring algorithm – use messaging between instances in the cluster to elect the coordinator. Service instances are ordered (either physically or logically) so each instance knows its successors. Ordering in the ring is significant with election messages being passed around the ring to figure out which one is ‘at the top’. That instance is elected the coordinator.

 

More detailed descriptions of the approaches are provided below. As you’d expect each has its pros and cons.

Distributed mutex

A mutex “ensures that multiple processes that share resources do not attempt to share the same resource at the same time”. In this case the ‘resource’ is really a role – that of coordinator - that one service instance adopts.

Using a distributed mutex has the advantage that it works in situations where there is no natural leader (e.g. no suitable process identifier which would be required for the Bully Algorithm). Under some circumstances (e.g. when the coordinator is the only instance performing any work) the service instances need not know about each other either; the shared mutex is the only thing an instance needs to know about. In cases where the coordinator needs to distribute work amongst the other instances in the cluster then the coordinator must be able to contact – and therefore know about – the other instances.

The algorithm essentially follows this process:

  1. Service instances race to get a lease over a distributed mutext (e.g. a database object).
  2. The first instance to get the mutex is elected as the coordinator. Other instances are prevented from becoming the coordinator because they are blocked from getting a lease on the mutex.
  3. The coordinator performs the task of coordingating the distribution of work (or executing it itself depending on requirements).
  4. The lease must be set to expire after a period of time and the coordinator must periodically renew the lease. If the coordinator crashes or becomes unavailable it won’t be able to renew the lease on the mutext which will eventually become available again.
  5. All service instances periodically check the mutex to see if the lease has expired. If a service instance finds the lease on the mutex to be available it attempts to secure the lease. If it succeeds the instance becomes the new coordinator.

 

Note that the mutext becomes a potential single point of failure so consideration should be given to a scenario where unavailability of the mutex can prevent the cluster from electing a coordinator.

Another characteristic of using a shared mutex in this way is that election of the leader is non-deterministic. Any service instance in the cluster could take on the role of coordinator.

A good explanation of the shared mutex approach can be found in this article from MSDN.

Bully algorithm

There are some assumptions for the Bully Algorithm:

  • Each instance in the cluster has a unique identifier which must be an ordinal. This could be a process number or even a network address but whatever it is we should be able to order instances in the cluster using this identifier.
  • Each instance knows the identifiers of the other instances that should be participating in the cluster (some may be dead for whatever reason).
  • Service instances don’t know which ones are available and which are not.
  • Service instances must be able to send messages to each other.

 

The basis of the Bully Algorithm is the service instance with the highest identifier will be the coordinator. The algorithm provides a mechanism for service instances to discover which of them has the highest identifier and for that instance to bully the others into submission by claiming the coordinator role. It follows this basic process:

  1. A service instance sends an ELECTION message to all instances with identifiers greater than its own and awaits responses.
  2. If no service instances respond the originator can conclude it has the highest identifier and is therefore safe to assume the role of coordinator. The instance sends a COORDINATOR message to all other instances announcing the fact. Other instances will then start to periodically check that the coordinator is still available. If it isn’t, the instance that finds the coordinator unavailable will start a new election (back to step 1).
  3. Any service instance receiving an ELECTION message and having an identifier greater than the originator will respond with an OK message indicating it’s available.
  4. If in response to an ELECTION message the originator receives an OK response back it knows there’s at least one service instance with a higher identifier than itself. The following then happens:
    1. The original service instance abandons the election (because it knows there’s at least one process with a higher identifier than itself).
    2. Any instances that responded to the ELECTION message with OK now issue ELECTION messages themselves (they start at step 1) and the process repeats until the service with the highest identifier has been elected.

A nice description of the process can be found in this article.

Ring algorithm

As with the Bully Algorithm there are some basic assumptions for the Ring Alorithm.

  • The service instances are ordered in some way.
  • Each service instance uses the ordering to know who its successor is (in fact it needs to know about all the instances in the ring, as we will see below).

 

The Ring Algorithm basically works like this:

  1. All service instances monitor the coordinator.
  2. If any service instance finds the coordinator is not available it sends an ELECTION message to its successor. If the successor is not available the message is sent to the next instance in the ring until an active one is found.
  3. Each service instance that receives the ELECTION message adds its identifier to the message and passes it on as in step 2.
  4. Eventually the message gets back to the originating process instance which recognises the fact because its own identifier is in the list. It examines the list of active instances and finds the one with the highest identifier. The instance then issues a COORDINATOR message informing all the instances in the ring which one is now coordinator (the one with the highest identifier).
  5. The service instance with the highest identifier has now been elected as the coordinator and processing resumes.

 

Note that multiple instances could recognise that the coordinator is unavailable resulting in multiple ELECTION and COORDINATOR messages being sent around the ring. This doesn’t matter, the result is the same.

Other things to look at

A NuGet package is available for a light-weight non-intrusive leader election library for .Net called NanoCluster. Source code is available on GitHub here:

https://github.com/ruslander/NanoCluster

It’s a small project and doesn’t seem to have been used a great deal but might provide some ideas.

References