Loose Bits · Thoughts on distributed systems, cloud computing, and the intersection of law and technology.

Background

We use the Celery distributed task queue library at work, which is great for running asynchronous tasks across multiple processes and servers. Celery has both user-initiated and periodic (think cron replacement) tasks, and we have found in practice that the system distributes tasks quite nicely across our farm of celery servers.

One issue we have is that for several of our periodic tasks, we need to ensure that only one task is running at a time, and that later instances of the same periodic task are skipped if a previous incarnation is still running.

The Celery documentation has a cookbook recipe for this scenario: “Ensuring a task is only executed one at a time”. The crux of the solution is to make a distributed lock using the Django cache (memcached in the example) with the following lambda’s:

lock_id = "something unique"
lock_expire = 60 * 5  # five minutes

acquire_lock = lambda: cache.add(lock_id, "true", lock_expire)
release_lock = lambda: cache.delete(lock_id)

Non-persistent Locks?

This approach works fine if the cache is shared across all celery worker nodes and the cache is persistent. However, if memcached (or some other non- persistent cache) is used and (1) the cache daemon crashes or (2) the cache key is culled before the appropriate expiration time / lock release, then you have a race condition where two or more tasks could simultaneously acquire the task lock. This “distributed cache lock” approach has been discussed in various posts, which all acknowledge the danger of relying on memcached for persistent data.

Distributed Locks with Redis

As noted by one of the links above, the simplest solution to this problem if you like using the cache for distributed locks is to switch to memcachedb which is not a caching solution per se, but rather a persistent key-value store that implements the memcached interface.