Monday, 3 October 2016

Election algorithms for clustered software

The problem

I’ve recently been looking at a problem with some software that was written to work in a cluster. This particular software service runs background jobs against a SQL Server database and in order to support fail-over scenarios the software was written to work in a cluster. Only one service instance (the master) was actually doing any work at any given time with the other instances (the slaves) providing redundancy in the case of a failure. In other words, one instance would be nominated as the master and would take responsibility for running the background jobs. If the master crashed or became unavailable one of the other instances in the cluster would take over as master.

From now on I’m going to continue to use the term service instance to describe a software comonent that participates in a cluster. Each service instance is probably a separate process.

The problem was that the mechanism used to elect and monitor the master was based on UDP broadcast, and broadcast is something that can be problematic in cloud-based environments such as AWS. Given there was a need to migrate this service to the cloud this was a significant issue.

At a high level the election algorithm being used by the cluster was for service instances to use UDP broadcast to exchange messages between themselves to agree which instance would be the master. Once the master had been nominated it took over the work of running the background jobs. The other service instances would then periodically poll the master to check that it was still alive. The first instance to find the master to be unavailable would claim the master role, take over the responsibility of running the jobs and broadcast the change in master.

The use of UDP broadcast in this context was useful because it meant that service instances didn’t need to know about each other. To use more direct addressing it would be necessary to store the addresses of all instances in the cluster in some form of registry or configuration. Configuration management across multiple environments is itself a challenge so reducing the amount of configuration can be an advantage.

However, in this case the use of UDP broadcast was an issue that needed to be addressed to facilitate a move to the cloud. This provided a good opportunity to review clustering election patterns and approaches to writing clustered software in general to see what options are available.

Note: There are alternatives to creating writing software that behave as a cluster natively (e.g. ZooKeeper). This article does not deal with these alternative approaches but focuses on the creation of natively clustered software.

Reasons for clustering

There are typically 2 reasons for writing software that supports clustering:

  • Failover – to prevent outages it would be advantageous to build in redundancy so that if one service instance crashes there’s another available to take up the slack. Note that in this case it isn’t necessary for all instances to be doing useful work. Some may be on stand-by, available to take over if the primary fails but not doing anything while the primary is active.
  • Performance – to facilitate greater application performance running separate software instances (probably on separate servers) may be advantageous. In this case work can be distributed between instances and processed in parallel.


Of course, these two aspects are not mutually exclusive; a cluster may support both high availability and distributed processing.

Characteristics of clustered software

Typically when running software as a cluster one instance will be nominated as the coordinator (leader or master). Note that this instance does not have to perform the work itself, it may choose to delegate the work to one of the other instances in the cluster. Alternatively – such as in our example above – the coordinator may perform the work itself exclusively.

This is somewhat analogous to server clustering which can be either symmetrical or asymmetrical. In the symmetrical case every server in the cluster is performing useful work. To distribute work between the servers in the cluster a load balancer is required. In the case of a software cluster it’s the instance elected as the coordinator that’s probably performing this task.

In the asymmetrical case only one server will be active with the other server instances in the cluster being passive. A passive instance will only be activated in the event of a failure of the primary. In the case of a software cluster the coordinator would be the active instance with other instances being passive.

Whichever basic topology is chosen it will be necessary for the software cluster to elect a coordinator when the cluster starts. It will also be necessary for the cluster to recognise when a coordinator has crashed or become unavailable and for this to trigger the election of a new coordinator.

When designing a system like this care should be taken to avoid the coordinator becoming a bottleneck itself. There are also other considerations. For example, in auto-scaling scenarios what happens if the coordinator is shut down as a result of downsizing the infrastructure?

Election patterns

How do software clusters go about managing the election of a coordinator? Below is a discussion of 3 possible approaches:

  • Distributed mutex – use a shared mutex is made available to all service instances and is used to manage which instance is the coordinator. Essentially, all service instances race to grab the mutex. The first to succeed becomes the coordinator.
  • Bully algorithm – use messaging between instances in the cluster to elect the coordinator. The election is based on some unique property of each instance (e.g. a process identifier). The process with the highest value ‘wins’. The winning instance bullies the other instances into submission by keeping the mutex and claiming the coordinator role.
  • Ring algorithm – use messaging between instances in the cluster to elect the coordinator. Service instances are ordered (either physically or logically) so each instance knows its successors. Ordering in the ring is significant with election messages being passed around the ring to figure out which one is ‘at the top’. That instance is elected the coordinator.


More detailed descriptions of the approaches are provided below. As you’d expect each has its pros and cons.

Distributed mutex

A mutex “ensures that multiple processes that share resources do not attempt to share the same resource at the same time”. In this case the ‘resource’ is really a role – that of coordinator - that one service instance adopts.

Using a distributed mutex has the advantage that it works in situations where there is no natural leader (e.g. no suitable process identifier which would be required for the Bully Algorithm). Under some circumstances (e.g. when the coordinator is the only instance performing any work) the service instances need not know about each other either; the shared mutex is the only thing an instance needs to know about. In cases where the coordinator needs to distribute work amongst the other instances in the cluster then the coordinator must be able to contact – and therefore know about – the other instances.

The algorithm essentially follows this process:

  1. Service instances race to get a lease over a distributed mutext (e.g. a database object).
  2. The first instance to get the mutex is elected as the coordinator. Other instances are prevented from becoming the coordinator because they are blocked from getting a lease on the mutex.
  3. The coordinator performs the task of coordingating the distribution of work (or executing it itself depending on requirements).
  4. The lease must be set to expire after a period of time and the coordinator must periodically renew the lease. If the coordinator crashes or becomes unavailable it won’t be able to renew the lease on the mutext which will eventually become available again.
  5. All service instances periodically check the mutex to see if the lease has expired. If a service instance finds the lease on the mutex to be available it attempts to secure the lease. If it succeeds the instance becomes the new coordinator.


Note that the mutext becomes a potential single point of failure so consideration should be given to a scenario where unavailability of the mutex can prevent the cluster from electing a coordinator.

Another characteristic of using a shared mutex in this way is that election of the leader is non-deterministic. Any service instance in the cluster could take on the role of coordinator.

A good explanation of the shared mutex approach can be found in this article from MSDN.

Bully algorithm

There are some assumptions for the Bully Algorithm:

  • Each instance in the cluster has a unique identifier which must be an ordinal. This could be a process number or even a network address but whatever it is we should be able to order instances in the cluster using this identifier.
  • Each instance knows the identifiers of the other instances that should be participating in the cluster (some may be dead for whatever reason).
  • Service instances don’t know which ones are available and which are not.
  • Service instances must be able to send messages to each other.


The basis of the Bully Algorithm is the service instance with the highest identifier will be the coordinator. The algorithm provides a mechanism for service instances to discover which of them has the highest identifier and for that instance to bully the others into submission by claiming the coordinator role. It follows this basic process:

  1. A service instance sends an ELECTION message to all instances with identifiers greater than its own and awaits responses.
  2. If no service instances respond the originator can conclude it has the highest identifier and is therefore safe to assume the role of coordinator. The instance sends a COORDINATOR message to all other instances announcing the fact. Other instances will then start to periodically check that the coordinator is still available. If it isn’t, the instance that finds the coordinator unavailable will start a new election (back to step 1).
  3. Any service instance receiving an ELECTION message and having an identifier greater than the originator will respond with an OK message indicating it’s available.
  4. If in response to an ELECTION message the originator receives an OK response back it knows there’s at least one service instance with a higher identifier than itself. The following then happens:
    1. The original service instance abandons the election (because it knows there’s at least one process with a higher identifier than itself).
    2. Any instances that responded to the ELECTION message with OK now issue ELECTION messages themselves (they start at step 1) and the process repeats until the service with the highest identifier has been elected.

A nice description of the process can be found in this article.

Ring algorithm

As with the Bully Algorithm there are some basic assumptions for the Ring Alorithm.

  • The service instances are ordered in some way.
  • Each service instance uses the ordering to know who its successor is (in fact it needs to know about all the instances in the ring, as we will see below).


The Ring Algorithm basically works like this:

  1. All service instances monitor the coordinator.
  2. If any service instance finds the coordinator is not available it sends an ELECTION message to its successor. If the successor is not available the message is sent to the next instance in the ring until an active one is found.
  3. Each service instance that receives the ELECTION message adds its identifier to the message and passes it on as in step 2.
  4. Eventually the message gets back to the originating process instance which recognises the fact because its own identifier is in the list. It examines the list of active instances and finds the one with the highest identifier. The instance then issues a COORDINATOR message informing all the instances in the ring which one is now coordinator (the one with the highest identifier).
  5. The service instance with the highest identifier has now been elected as the coordinator and processing resumes.


Note that multiple instances could recognise that the coordinator is unavailable resulting in multiple ELECTION and COORDINATOR messages being sent around the ring. This doesn’t matter, the result is the same.

Other things to look at

A NuGet package is available for a light-weight non-intrusive leader election library for .Net called NanoCluster. Source code is available on GitHub here:

It’s a small project and doesn’t seem to have been used a great deal but might provide some ideas.


Friday, 12 August 2016

Problems installing Microsoft .Net Core 1.0.0 VS 2015 Tooling Preview 2


I had a funky installation of Visual Studio. Everything seemed to be working but when I came to install the Microsoft .Net Core 1.0.0 VS 2015 Tooling it refused to install:


Repeated attempts to repair or reinstall Visual Studio 2015 Enterprise just wasn’t getting anywhere and I wasn’t in a position to flatten the machine and start again. What was I to do?


I managed to run the installer and force it not to check Visual Studio using a command line switch:

ukafr02@UKSTOL0079 C:\Users\ukafr02\Downloads
$ DotNetCore.1.0.0-VS2015Tools.Preview2.exe SKIP_VSU_CHECK=1




Happy days!

Friday, 3 June 2016

Enter RSA passphrase once when using Git bash

When using Git bash it can become annoying if you have to enter your RSA passphrase every time you perform a remote operation. You can easily prevent this by using ssh-agent and running it when Git bash first runs.

Open a text editor and create a new text file. Paste the following bash script into the file:

#! /bin/bash 
eval `ssh-agent -s` 
ssh-add ~/.ssh/*_rsa

Note that my RSA key file is called id_rsa and is stored in the .ssh folder in my user’s home directory.

Save the file created above as .bashrc in your user’s home directory. Now, when you start a Git bash session you should be prompted for your passphrase. This will remain active for the duration of the session and you won’t have to enter it again.


See Saving an SSH key when using Git for info on generating the RSA key file.

Thursday, 2 June 2016

Git recipies

This post is a quick aide-mémoire for basic command-line Git operations. Well worth reading is the Git Getting Started documentation.

Clone a remote repository for the first time

To get started crack open Git bash and go to your source folder. You can then clone a remote repository into a new folder simply by running the following command:

someuser@mymachine MINGW64 ~
$ cd c:

someuser@mymachine MINGW64 /c
$ cd Source/

someuser@mymachine MINGW64 /c/Source
$ git clone git@some:repo/Some.Project.git
Cloning into 'Some.Project'...
Enter passphrase for key '/c/Users/someuser/.ssh/id_rsa':
remote: Counting objects: 1402, done.
remote: Compressing objects: 100% (1311/1311), done.
emote: Total 1402 (delta 1018), reused 87 (delta 59)Receiving objects:  87% (1220/1402), 580.00 KiB | 1.08 MiB/s

Receiving objects: 100% (1402/1402), 1.15 MiB | 1.08 MiB/s, done.
Resolving deltas: 100% (1018/1018), done.
Checking connectivity... done.

someuser@mymachine MINGW64 /c/Source

See the Git - git-clone Documentation.

List branches

The following command lists all local and remote branches:

someuser@mymachine MINGW64 /c/Source/Some.Project (master)
$ git branch -a
* master
  remotes/origin/HEAD -> origin/master

See Git – git-branch Documentation.

To see remote branches only use the –r switch. For local only use the –l branch.

someuser@mymachine MINGW64 /c/Source/Some.Project (master)
$ git branch -r
  origin/HEAD -> origin/master

someuser@mymachine MINGW64 /c/Source/Some.Project (master)
$ git branch -l
* master                                                           

Getting more information about the remote repository

someuser@mymachine MINGW64 /c/Source/Some.Project (master)
$ git remote -v
origin  git@some:repo/Some.Project.git (fetch)
origin  git@some:repo/Some.Project.git (push)

someuser@mymachine MINGW64 /c/Source/Some.Project (master)
$ git remote show

someuser@mymachine MINGW64 /c/Source/Some.Project (master)
$ git remote show origin
Enter passphrase for key '/c/Users/someuser/.ssh/id_rsa':
* remote origin
  Fetch URL: git@some:repo/Some.Project.git
  Push  URL: git@some:repo/Some.Project.git
  HEAD branch: master
  Remote branches:
    develop                          tracked
    feature/IntialLoggingIntegration tracked
    master                           tracked
  Local branch configured for 'git pull':
    master merges with remote master
  Local ref configured for 'git push':
    master pushes to master (up to date)

Checkout a remote branch

To switch to a new remote branch and check its status:

someuser@mymachine MINGW64 /c/Source/Some.Project (master)
$ git checkout develop
Branch develop set up to track remote branch develop from origin.
Switched to a new branch 'develop'

someuser@mymachine MINGW64 /c/Source/Some.Project (develop)
$ git status
On branch develop
Your branch is up-to-date with 'origin/develop'.
nothing to commit, working directory clean

Create a new branch

This example shows how to create a feature branch off develop:

someuser@mymachine MINGW64 /c/Source/Some.Project (develop)
$ git checkout -b feature/test develop
Switched to a new branch 'feature/test'

See the Git - git-checkout Documentation. Also see the Git Branching - Basic Branching and Merging.

Delete a branch

The following example shows how to delete a branch. Note you can’t be on the branch you’re deleting.

someuser@mymachine MINGW64 /c/Source/Some.Project (develop)
$ git branch -D feature/test
Deleted branch feature/test (was 2647fba).

See the Git - git-branch Documentation. Also see the Git Branching - Basic Branching and Merging.

Revert changes to a file

You can see which files have been modified locally using the “git status” command and then undo the changes with “git checkout”:

someuser@mymachine MINGW64 /c/Source/Some.Project (develop)
$ git status
On branch develop
Your branch is up-to-date with 'origin/develop'.
Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

        modified:   src/Some.Project/App.config
        modified:   src/Some.Other.Project/config/App.config

no changes added to commit (use "git add" and/or "git commit -a")

someuser@mymachine MINGW64 /c/Source/Some.Project (develop)
$ git checkout -- src/Some.Project/App.config

Merge the develop branch into a feature branch

If you are using Git Flow and you are working on a feature branch you might want to merge develop into your feature branch from time to time to minimise conficts once your feature is complete. The basic commands to use would be:

someuser@mymachine MINGW64 /c/Source/Some.Project (develop)
$ git checkout feature/test
Switched to branch 'feature/test'
Your branch is up-to-date with 'origin/feature/test'.

someuser@mymachine MINGW64 /c/Source/Some.Project (feature/test)
$ git merge develop

See the Git – git-merge Documentation.

Checking to see what might be committed (dry-run)

An easy one - how can you see what will be committed without actually doing so? Like this:

someuser@mymachine MINGW64 /c/Source/Some.Project (develop)
$ git commit --dry-run

See the Git – git-commit Documentation.

Delete a local branch

someuser@mymachine MINGW64 /c/Source/Some.Project (develop)
$ git branch -d SomeBranchName
Deleted branch SomeBranchName (was 76e05d9).

See the Git – git-branch Documentation.

Thursday, 19 May 2016

Diagnosing log4Net issues in an ASP.Net web application

The Problem

I was trying to get a legacy web application up-and-running on a development workstation but log4Net was not generating any log files. The application was being run from a local instance of IIS 7.5 and built using Visual Studio 2013.

Stepping through the code did not reveal anything relevant and the code to initialise log4Net was being called. So, the problem was how to see what log4Net was doing internally to identify any errors during initialisation.

The Solution

The solution was to enable log4Net’s internal debugging. To do so I modified the web.config file to include a couple of new entries. Firstly, a new appSettings entry:

    <!-- other settings omitted -->
    <add value="true" key="log4net.Internal.Debug" />

Then add a new trace listener to system.diagnostics:

    <trace autoflush="true">
            initializeData="C:\log4net.txt" />

Starting the application now created a log4net.txt file in the root of my C-drive. A quick scan of that log file revealed the issue:

log4net:ERROR [RollingFileAppender] ErrorCode: GenericFailure. Unable to acquire lock on file D:\applog\_root_20160518.0.log. The device is not ready.

As it happens there's no D-drive on my machine! Further investigation revealed that the machine image used to build my workstation included a machine-level web.config file that contained the offending log file location (yes, horrible I know). I changing the entry, restarted the web application and log4Net started working.


Monday, 16 May 2016

Creating a SQL Server alias

The Problem

I have a laptop with a SQL Server Express 2012 installed and configured as a named instance (localhost\SQLEXPRESS2012). I’ve cloned a code repository and want to be able to run the applications it contains but they are all configured with connection strings that look for a different instance name (localhost\dev_2012).

I could start modifying connection strings but there are multiple applications and therefore connection strings to modify. I’d prefer to be able to create an alias to the database that matches the one in the configuration files so I don’t need to modify them at all.

The Solution

The solution is to create an alias to the named instance using the SQL Server Configuration Manager.

Open the SQL Server Configuration Manager (Start Menu > All Programs > Microsoft SQL Sever 2012 > Configuration Tools).

Check that TCP/IP is enabled for the instance you are creating an alias for. Enable it if it is not.


Once TCP/IP is enabled we can create an alias to the instance for the SQL Native Client. In my case this was for the 32-bit version. Expand the SQL Native Client 11.0 Configuration element and right-click on Aliases. Select New Alias… from the context menu.

Use the new instance name as the Alias Name and set the Server value to the original named instance.

2016-05-16 11_03_01-localhost_dev_2012 Properties

Note the Port No field. By default SQL Server uses 1433 but you can check your setup using the SQL Server Configuration Manager. Open the SQL Server Network Configuration element again and select the protocols for the named instance. Right-click on TCP/IP and view the Properties.

2016-05-16 11_13_10-TCP_IP Properties

If Listen All is enabled on the Protocol tab move to the IP Addresses tab and scroll down to the IPAll section. If Listen All is not enabled you will need to look for the appropriate IP section.

If SQL Server is configured to use a static IP address it will appear in the TCP Port value. If is is configured to use a dynamic port the port in use will appear as the TCP Dynamic Port value. In either case this is the port number to use when creating the alias.

2016-05-16 11_15_22-TCP_IP Properties

Click OK to close any open dialog. Restart SQL Server.

The new instance name will now get routed to the actual instance (calls to localhost\dev_2012 will get routed to localhost\SQLEXPRESS2012 in this case).

You can check everything works by connecting the SQL Server Management Studio to the new instance name.

Tuesday, 12 April 2016

UKMail customer service – FAIL!

We are living the dream. As a family we take full advantage of online retailing but there’s one aspect of the process that seems to be in need of improvement: delivery. 

A company that I’ve had a few unsatisfactory dealings with recently is UKMail. I’ve had disappearing deliveries where either they didn’t try to deliver or – more likely – they did but the driver decided not to leave a card.

Phone system – FAIL

On one such occasion I tried calling by phone and got bounced through the usual impenetrable audio menus until I encountered an option to arrange for a re-delivery. I dutifully pressed 1 as instructed only to be told “Thank you for calling UKMail. Goodbye.” after which I was immediately cut off.


Happy? No. Really, no.

Contact Us – FAIL

Having tried and failed to use their phone system I decided to raise a complaint using the online Contact Us form. It seems UKMail were ahead of me there and seem to have created a form that you can’t actually submit.


Try as I might I couldn’t identify what the erroneous character was. I suspect there’s a proud customer services manager gleefully including in his weekly report the fact that no one is complaining via the website. Now we know why.

I pointed this out on Twitter but I think they missed the point. The following Tweet did result in contact from customer services but only to try and rearrange delivery of the parcel, which I’d already managed to do.


Delivery notification – FAIL

So today, I get home to find yet another UKMail card lying on the door mat but this time annotated by a clearly irritated UKMail delivery man.


Loving the “Again”. It didn’t irritate the hell out of me at all.

It might be stating the obvious but if you keep trying to deliver at the same times and there’s never anyone in then you might be trying at the wrong times.

But let’s look a bit closer. This card suggests we’ve been notified in advance in order to give us the change to choose “option 1”. So I checked my email and this is what I found.


This email arrived at 09:42hrs on the day of delivery, only 2 hours ahead of the earliest delivery time given.

What do UKMail expect here? Do they expect us all to be monitoring personal email while at work and filling in online forms to arrange delivery at another time? I’m pretty sure my boss wouldn’t be too happy about that.

UKMail, if you insist on sending these emails at least give us a chance to answer them. Two hours isn’t enough notice. And perhaps point this out to your delivery men so they don’t get snippy on your cards.

Sorry we missed you email – FAIL

This one speaks for itself.


Option 1 – FAIL

OK, so the card says quite clearly to visit and to select ‘Manage My Delivery’. I did just that, entered the card number and postcode as directed and ended up here.


Can’t see an option 1, 2, 3 or 4 there… Definitely can’t see a “Leave in a safe place option”… Not sure what to do now.

I give up. I think I’ll be looking out for UKMail as a delivery option when making online purchases and selecting something else!

Collect from depot – FAIL

OK, let’s try the Collect from depot option.


Right. No idea what times I can collect then. Not even a default “between 9am and 5pm”. Remember I’m working so what are the chances of them being open after I finish work (i.e. after 5pm)? Nil, I suspect. Do I risk it..?

Well I do as it happens and I end up with this:


They don’t like times at UKMail do they - unless it’s giving you 2 hours notice of a delivery.


As a consumer I very often don’t even know which delivery company will deliver any given online purchase. Even if I do, it’s usually the case that I don’t get the chance to specify delivery options such as ‘leave in a safe place’. That seems like a failing to me.

If you leave instructions on a card make sure those instructions can actually be completed by the customer.

Notifications of impending deliveries 2 hours beforehand is a waste of time where people can be expected to be at work.

And why try to deliver during the day at all? Surely most people are out at that time, at least as far as domestic customers are concerned. Wouldn’t evening deliveries by less wasteful in time and resources, not to mention creating better customer relationships?

Online forms that are difficult to submit will aggravate end users, especially when they are already aggravated. Emails that refer to buttons that aren’t there are plain sloppy.

Any individual item given above wouldn’t mean much but when combined result in reduced customer satisfaction and loss of confidence in the service.

Now, let’s try and rearrange delivery of that parcel…

Update – 13/04/2015

After contacting UKMail via Twitter I received the following message:


So even though I’ve said it’s OK to leave the parcel in a safe place they won’t do it. This illustrates the problem with deliveries and online purchasing. If I as the consumer am not able to specify delivery options like this at the point of sale and if delivery companies won’t allow deliveries outside normal working hours it makes life difficult for all concerned.

It also occurred to me that I hadn’t shown what happens if you try to get your parcel redelivered.


You can’t specify a time of delivery, not even morning or afternoon let alone after 5pm. What are the options here? Only one: take the day off and sit around all day waiting for a delivery. For large or expensive deliveries that might be a viable option but for most of what I do it’s not.

Update – 14/04/2016

As you can see from the screenshots in the Collect from depot – FAIL section above I used the online system to arrange collection from the depot. Imagine my surprise to be sent the following email:


As usual the email arrived just 2 hours before the scheduled delivery time (at 10:03hrs actually). So, UKMail have ignored my instruction that I’d be picking the parcel up from the depot and I can expect another crappy card with some suitably irksome message from an irritated delivery driver waiting for me at home.

Just pathetic.

And will the parcel be waiting for me at the depot? Should I waste my time trying to pick it up as I’d arranged?

Update - 15/04/2016

I don’t know why I bothered but I tried collecting the parcel as previously arranged. No surprises here – the result was a big fat “Sorry mate. Bad news I’m afraid…” from the UKMail man in reception. It seems the parcel was still on the van and wouldn’t be back at the depot until 7pm.

So I had to make another journey to collect the parcel after 7pm taking an hour out of my evening. I have the parcel now but that’s it for me. If an online vendor owns up to using UKMail I’m shopping elsewhere.