Why Distributed Mutexes?

When multiple threads of execution are trying to modify a single shared resource - be it a file or the result of a computation - consistency can be compromised unless proper synchronization of the threads is guaranteed. A mutex is a device that guarantees mutual exclusion - that only one thread can touch a shared resource at a time, thereby protecting the shared state from race conditions.

If the two threads of execution happen to run in different containers, the problem is only made worse by the erratic nature of computer networks interconnecting them.

In an ideal world, the solution would be simple and we wouldn’t need distributed locking at all - everything could be synchronized implicitly using distributed queues. But, in the real world where you have to support an existing code-base, using an explicit mutex is a simple and reliable way to avoid races.

The Current State and Reasons to Change It

We’ve used mutexes before: Our current solution uses Redis via the redis-mutex gem. This solution actually works most of the time, but there are some issues and we were motivated to look into other solutions:

  1. We’ve long wanted to get rid of Redis. It presents an unnecessary dependency for us, because we also use ETCD for service discovery. Since ETCD supports locking for some time and since we only use Redis for locking, we have decided to simplify our tech stack, cut down our infrastructure costs and the amount of maintenance work.

  2. We’ve seen Redis misbehave before. Here’s what it looks like when it runs out of memory and can’t allocate new memory to fix the issue:
    [565] 06 Sep 13:44:51.044 # Can't save in background: fork: Cannot allocate memory
    [565] 06 Sep 13:44:57.048 * 1 changes in 900 seconds. Saving...
    [565] 06 Sep 13:44:57.048 # Can't save in background: fork: Cannot allocate memory
    [565] 06 Sep 13:45:03.066 * 1 changes in 900 seconds. Saving...
    [565] 06 Sep 13:45:03.066 # Can't save in background: fork: Cannot allocate memory
    [565] 06 Sep 13:45:09.094 * 1 changes in 900 seconds. Saving...
    [565] 06 Sep 13:45:09.094 # Can't save in background: fork: Cannot allocate memory
    [565] 06 Sep 13:45:15.003 * 1 changes in 900 seconds. Saving...
    [565] 06 Sep 13:45:15.005 # Can't save in background: fork: Cannot allocate memory
    [565] 06 Sep 13:45:21.021 * 1 changes in 900 seconds. Saving…
    

    But it needed new memory to create snapshot before we could delete some keys.

    127.0.0.1:6379> DEL foobar
    (error) MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
    

    We were able to resolve the issue by disabling the snapshots, deleting some data, and then re-enabling the snapshots. While this issue is known to people using COW filesystems (btrfs, for example), it still was quite unpleasant and surprising coming from Redis.

  3. This goes hand-in-hand with master/master replication not really being a thing that actually works in the land of Redis.

  4. As for the gem itself, when redis-mutex cannot acquire a lock (e.g. because the lock is already held by someone else), it has an option for waiting for a certain amount of time for the lock to be released. This is a handy feature, but implementation-wise, it uses polling in configurable intervals (so it’s basically busy-waiting for the lock to become available). With many concurrent locking attempts, this seemingly-innocent approach leads to substantial network traffic. Also, it’s an ugly solution. (But since Redis does not provide state change notifications, we cannot think of a better one either.)

Considering New Solutions

Before we get into the details of what we picked as a replacement (spoiler alert, it’s ETCDv3), let’s talk about another candidate: postgresql.

Postgresql provides two mechanisms for locking - advisory locks and SELECT FOR UPDATE. Ignoring, for the moment, that not all our services are using a database, it’s still a no-go. As far as we know, locks are not part of WAL, so they are not synchronized to replicas. Since we’re using Patroni for seamless database failover, this is a serious problem.

What’s more, session-level locks don’t play very well with pgbouncer in transaction pooling mode, and transaction-level locks are hard to use when your ORM creates implicit transactions in several locations. Also, nesting them is just a bit weird.

An important part of the decision process concerned timeouts. We want to be able to tell the lock how long it could wait to acquire the lock. That is not too easy to do with Postgresql.

Why (and how) ETCD3

Now let’s talk a bit about why we picked ETCD3.

  1. We already have it running and configured in HA setup.
  2. It’s really f*ckin’ fast.
  3. It’s easy to work with in many languages

This is where the real story begins. We thought that a simple gem to acquire an ETCD lock could be pulled in a day or so and could see proof-of-concept implementation in a few services by the end of the week. Well, we were really wrong about that and we needed to step back and think this over - twice - to get it right:

First take: ETCDv2 HTTP API

We wanted to start with something simple, so picking ETCDv2 HTTP API and interacting with the key store using simple REST API seemed like the right thing to do, and in line with our goal of creating something simple. And, if I’m being honest, we hadn’t done proper research in advance - so, among other things, we didn’t realize that ETCDv3 is completely different from the previous major version.

Thanks to the existence of the prevExist query argument in the ETCDv2 RESTful API, locking can be implemented easily. Here is part of the original class that was doing the locking:

private def http_client
  if not @http_client
    @http_client = Net::HTTP.new('127.0.0.1', 2379)
    @http_client.open_timeout = OPEN_TIMEOUT_SEC
    @http_client.read_timeout = READ_TIMEOUT_SEC
    @http_client.ssl_timeout = SSL_TIMEOUT_SEC
  end
  return @http_client
end

def try_lock
  resp = http_client.send_request('PUT', "/v2/keys/#{lock_name}?prevExist=false")
  case resp
  when Net::HTTPSuccess
    return true
  when Net::HTTPPreconditionFailed
    return false
  else
    raise InvalidLockStateException
  end
rescue
  raise CommunicationError
end

Notice setting the timeouts in the first method, the call in the second, and recall that we want to limit how long it waits for the lock at max. Do you see the issue? Well, we didn’t at first - but while we tested our code, we soon realized what was happening.

The issue is that Net::HTTP always does one retry when the HTTP method used is idempotent (which PUT is). This may come handy at times, but it’s unfortunate in this case, because now we are no longer able to guarantee that locking will either succeed or fail in the requested time.

Since 2.5.0 there is #max_retries= but we still use some pre-2.5.0, so this was not an option.

This is definitely solvable, either by monkey-patching Net::HTTP or by using a different http client. But, while browsing ETCD documentation we noticed that ETCDv3 has native locking support! And, it’s using grpc instead of http protocol, so the retries issue would (hopefully) be solved as well.

So, with great enthusiasm, we decided to switch to ETCDv3 native locking instead, and that definitely seemed like the proper way to go.

Second take: ETCDv3 native locks

We need distributed locks and ETCD (since version 3.2) comes with native locks included. Wow, that means everything is perfect and the job is almost done, right? Well, not really.

To communicate with ETCD, we decided to use the etcdv3-ruby gem. It nicely abstracts away lots of implementation details and gives a nice API:

require 'ETCDv3'
conn = Etcdv3.new(endpoints: 'https://hostname:port')
conn.put('foo', 'bar')
conn.get('my')

Reasonably easy to work with. However, the gem had no support for the locks at that time. Despite that, all the foundation was there, it’s open source (under MIT even!), so we could just add the support ourselves.

That actually proved more challenging than expected. Here you can see what was merged in the end. 21 files changed across 15 commits; +303, -26. That merge commit both fixed multiple bugs and added new features.

In one go, we managed to add support for native locks, support for Ruby 2.5.0+, and improve Travis tests (now it tests a matrix of various Ruby and ETCD versions). Having done all that, why was this not the final version? And, why aren’t the locks production-ready yet? The reason we got into trouble was logging.

In a distributed environment, it’s all about logging. Our platform is a big machine, with many different parts doing many different things. When something goes awry, logs are often the difference between being out of business and a quick hotfix that no one even notices. For proper logging, you need to attach little labels to locks – e.g. by whom, why, and when was it locked, how long they expect to hold it, and so on. You need all that for alerting and monitoring, and just to have a general idea about what’s happening in the machine.

But, that’s all quite difficult with ETCD native locks. You basically get an atomic check if the key exists, and a lock if it does not. There are some limitations, however. For example, each lock needs to be attached to a lease (check Lease API), and that means that every lock is inherently auto-expiring. Sure, you can set the expiration to ten years, but that’s far from optimal.

Whether or not locks should auto-expire is a topic all its own.

On one hand, you need them to expire so that the platform can auto-heal should a lock stay locked for no reason and prevent other workers from doing their jobs. And if the owner of the lock dies, you need the lock to expire quickly.

On the other hand, locks must never expire too early - that could easily lead to a race condition. That’s what happened sometimes with the Redis locks: they silently expired, yet the original lock of the owner was still alive and touching some shared state. It didn’t really happen very often that locks would expire and cause problems for us - but if it can happen, you must account for it. When something weird happened and locks were involved, you just had to think about the possibility of some lock silently expiring somewhere. This makes it really hard to reason about programs, and the distributed environment in which they’re running only makes it harder.

We were annoyed by the fact that we couldn’t store the metadata during the locking process. All you can do is lock the lock and then write the metadata. What should you do if you manage to lock the lock but fail to write the metadata? That may render alerting nonfunctional, because now your lock is missing crucial information, such as when it was locked and who locked it in the first place. Also, the ETCD calls have some network overhead, and having three calls instead of one (one for the lease, one for the lock, and one for all the metadata) makes the latency noticeable.

Another issue is that the lease API sucks. Given that you can only give it TTL - the advisory time-to-live in seconds - there is no guarantee it will actually be honored. You can go ahead and ask for a 10 year lease, but you may not get it. It’s likely that there’s a reason for that, but it’s not making things any easier. Instead, what you need to do is get a “regular” lease (couple of minutes) and refresh the lease periodically.

The problem is, when you fail to refresh the lease (e.g. because some operation suddenly takes 30 minutes instead of five), the lock is immediately and silently unlocked.

Who could have known all that? We got rid of our original “lame” implementation in favor of the “proper” one, only to find out that it lacks many essential properties. That’s when we decided to take a fresh look…one last time.

Take 3: Transactions

Let’s try to think about this once more, this time on more elemental level. What do we actually want to accomplish? We need to have a way to atomically:

  • check if a key exists
  • if not, write the “lock key” and the metadata
  • otherwise, the lock is held and we need to wait, but busy waiting is not acceptable

Hmm, seems doable. ETCD has a cool transactions feature, which allows us to do the “check and write” part, and the API to use it from Ruby is pretty nice. Let’s borrow from ETCDv3’s README again:

# https://github.com/davissp14/ETCDv3-ruby/blob/master/lib/ETCDv3/kv/transaction.rb
conn.transaction do |txn|
  txn.compare = [
    # Is the value of 'target_key' equal to 'compare_value'
    txn.value('target_key', :equal, 'compare_value'),
    # Is the version of 'target_key' greater than 10
    txn.version('target_key', :greater, 10)
  ]

  txn.success = [
    txn.put('txn1', 'success')
  ]

  txn.failure = [
    txn.put('txn1', 'failed', lease: lease_id)
  ]
end

This could work well! Checking that a key does not exist can be done by comparing its version field with 0:

def keyname(metadata_name:)
  "/mutexes/#{@lock_name}/#{metadata_name}"
end
# ...
conn.transaction do |txn|
  txn.compare = [
  # if the version of the key is zero, the key does not exist
  #   => lock not held by anyone
    txn.version(keyname(metadata_name: :uuid), :equal, 0),
  ]

  txn.success = [
    txn.put(keyname(metadata_name: :uuid), @uuid),
    txn.put(keyname(metadata_name: :hostname), @hostname),
    txn.put(keyname(metadata_name: :locked_at), Time.now.utc.to_i,
  ]
end

(Note: The code was simplified to be easier to follow.)

We use the uuid subkey instead of a lock name so we can make sure that any instance will not unlock someone else’s lock by mistake:

def unlock!(timeout: @timeout)
  resp = @client.transaction(timeout: timeout) do |txn|
    txn.compare = [
      txn.value(keyname(metadata_name: :uuid), :equal, @uuid),
    ]

    txn.success = [
      txn.del(:uuid),
      txn.del(:hostname),
      txn.del(:locked_at),
    ]
  end

  raise LockNotLockedError unless resp.succeeded
rescue GRPC::DeadlineExceeded
  raise UnlockTimeout
end

Here you can even see the timeouts. So what’s left to solve? When a lock is locked, we need to do a retry. But, once again, we don’t think that polling is an acceptable solution. Luckily, ETCD provides an easy-to-use watch API:

events = conn.watch('foo', timeout: 10)

(The timeout argument was not supported - that’s another thing we’ve added.)

Now that we have that, cool locks can be implemented along the lines of:

# Returns how much remains of the +deadline+ time given. If +deadline+ is
# in past, +0+ is returned.
private def remains(deadline)
  [0, deadline - Time.now].max
end

def lock!(timeout: @timeout)
  deadline = Time.now + timeout
  loop do
    resp = try_lock!(timeout: remains(deadline))
    break if resp.succeeded

    @client.watch(
      keyname(metadata_name: :uuid),
      start_revision: resp.header.revision,
      timeout: remains(deadline)
    )
  end
rescue GRPC::DeadlineExceeded
  raise LockTimeout
end

This uses a start_revision parameter so we get any updates to key keyname(metadata_name: :uuid) - which happened after we tried to write it (because resp.header.revision returns keystore version at that moment). As the only changes to :uuid subkey are deletes, we know that someone released the lock and we can try to lock it again.

remains is just a handy function to calculate for how much longer we should

Wait.

And there you have it - a simple, reliable, distributed locking mechanism. With some nice wrappers around it, usage is simple:

SM::ClusterMutex.setup(host_list: %w{
  http://example1.org:2379
  http://example2.org:2379
})

SM::ClusterMutex.with_lock("lock1") do
  # TODO: do something
end

SM::ClusterMutex.with_lock("lock2") do |mtx|
  # This loop processes a lot of items
  xx.each do |i|
    # TODO: do something with item +i+
    # Signal that we are still alive, so that the lock does not expire
    mtx.alive!
  end
end

Lock Expiration

Now we have reliable locks, but they still don’t expire. The good news is that we found a really cool way to solve this - we’ve decided to create a separate unlocker microservice.

This has several advantages: first off, you can alert before you unlock. One of the lock’s parameters is alert_at, a UTC timestamp which is always several minutes in the future. A call to alive! on the lock pushes the stamp further. When you lock the lock, you state your expected runtime (e.g. 300 for 5 minutes worth of work) and then, if you don’t call alive! in that interval, the unlocker service will start to alert you that something’s probably wrong.

(Well, that’s not completely true. The service merely provides a /metrics collection end-point with Prometheus metrics, which are alerted upon. For these alerts we have Icinga checks, and it all meshes nicely with the rest of the alerting machinery.)

OK, so we have alerts. But when will the locks actually expire? There’s another key called auto_unlock_at. If you fail to push this timestamp forward in time, you will, as the owner of the lock, be considered dead and the unlocker service will remove the lock on your behalf. And yes, the monitoring machinery will let you know that something bad has happened.

This approach has yet another advantage. Some of our services are written in Python and we expect to switch them to the new locking approach soon. Separating the unlocking functionality into a shared service makes it much simpler to write the locking library. It basically boils down to writing some keys to ETCD and handling some errors, and all of that is language-specific and cannot be shared anyway.

Monitoring

With the unlocker service, the solution meets all our requirements with one minor exception - it’s still hard to see which service holds which locks and for how long. But, with all that metadata, it shouldn’t be too hard to implement this, right?

Right. We wrote yet another service called mutexdash (for Mutexes Dashboard) which shows all locks, who holds them, what they are used for, and even the backtrace at the time of the call to lock! - that’s a cool feature that allows you to get to the proper code quickly.

The End of the Road

Now the journey is almost over, so what did we end up with? First the obvious question, what is the impact on performance? Did it get faster? Did it get slower? As always, benchmarking is the only way to answer this question.

n = 100
Benchmark.bm do |x|
  x.report("Redis") { n.times { RedisMutex.with_lock(:foo_bar_dummy_bm) {} } }
  x.report("Etcd3") { n.times { SM::ClusterMutex.with_lock("foo_bar_dummy_bm") {} } }
end

This will 100x lock and unlock the key and time it. It was executed in a docker container running in our staging environment. The ping to the Redis master is around 0.3ms and to ETCD3 cluster around 0.4ms.

/tmp# ./bm.rb
       user     system      total        real
Redis  0.020000   0.000000   0.020000 (  0.225834)
Etcd3  0.200000   0.160000   0.360000 (  0.624115)

So, it’s 2.7x slower, but that was partially expected. I did not expect it to be that much slower, but it makes sense. We have to take the differences in architectures between Redis and ETCD into account. Redis has master-slave architecture, which means that almost everything can be done in-memory on the master. Etcd is based around raft protocol, which is a consensus algorithm. That means the machines in the cluster needs to talk to each other, creating much more overhead. But, ignoring percents and looking at actual numbers, we see that it’s 2.2ms vs 6.2ms per lock, both of which are more than ok for a distributed locking mechanism going over network. (But you have to watch your latencies and jitter!)

Performance-wise it’s not exactly a win. But it’s not all a speed game. Where ETCD is clearly superior is reliability. Since all nodes are basically equal and using raft, as long as we can reach even single one of them, we can at least try to do a lock. Worst case scenario is that the locking will fail, but it will behave correctly even facing various network splits and other unlikely-but-possible scenarios.

What a journey! It took us three shots to finally hit the target, but it was worth it - we have a reliable locking scheme which is simple to use, easy to monitor, and allows for the safest possible unlock we can think of. While the performance is a bit worse than before, it is more resilient and scalable - and that’s a trade-off we’re comfortable with.

  • TODO: opensource unlocker
  • TODO: opensource mutexdash
  • TODO: opensource sm_cluster_mutex

Please check the original version of this article at