Loose Bits Thoughts on distributed systems, cloud computing, and the intersection of law and technology.

Rackspace Cloud Files and Pseudo-Directories

Cloud Providers and Listing Storage Objects

Rackspace’s Cloud Files provide a basic distributed blob / file abstraction. Like Amazon Web Services Simple Storage Service (AWS S3) and Microsoft Azure Blobs, Cloud Files are organized into a two level hierarchy consisting of:

  • Containers: A collection of file objects. In AWS S3 parlance this is called a “bucket”. For all three cloud providers, a container is a single collection unit that cannot be further subdivided.
  • Storage Objects: A single file object. In AWS S3, this is an “object”, while in Azure this is a “blob”. A storage object for all the providers similarly cannot be further nested / subdivided.

Thus, there are no further levels of hierarchy than container, then object, and for all three providers, the storage object namespace is completely flat.

As most of the world is used to nested file hierarchies much deeper than this, cloud providers allow wide leeway in naming storage objects, most notably allowing characters like a slash (“/”). To provide the illusion of a nested hierarchy, AWS S3 and Azure Blobs provide listing operations that take a delimiter character to treat as a nested directory delimiter. In this manner, calls to the list objects API will return results as if there are intermediate directories in the (really) flat storage object namespace.

The Old Way - Rackspace Cloud Files and Dummy Directory Objects

To much frustration, Rackspace Cloud Files did not originally provide a delimiter query API for treating a delimiter as a nested directory organizer. Instead of delimiter queries, Rackspace required that clients upload a storage object of type “application/directory” at each level in which a nested pseudo- directory was desired. Only then would the results start to resemble those from AWS / Azure with delimiter queries.

Read more...

Celery Logging with Python Logging Handlers

Logging in Celery

We use Celery as our backend messaging abstraction at work, and have lots of disparate nodes (and across different development, test, and production deployments). As each system deployment now contains a large (and growing) number of nodes, we have been making a heavy push towards consolidated logging through a central logging server sink, using syslog (specifically rsyslog).

So, ideally, we would just use Celery’s configuration to specify a syslog handler (or maybe use a pipe). Unfortunately, it seems there is just not a simple way of doing this straight from Celery, as the only logging configuration parameters (with possible values) are:

CELERYD_LOG_FILE = "/path/to/file.log"  # File logging.
                                        # (OR)
CELERYD_LOG_FILE = None                 # stderr.

which means you either get a file logger (filehandler) or a stderr logger (streamhandler).

Hints and guidance on the web in terms of getting arbitrary logging handlers (or syslog specifically) is sparse, as well as noisy. A lot of the code and discussion I found used Celery as the backend for a generic logging framework or library. But, I found very little in terms of getting Celery tasks and processes to actually log to syslog.

Bringing Arbitrary Logging to Celery

As nothing magically appeared as the “right” solution, I considered a couple of different ways to hook things up:

  1. Watched Files: Have Celery log to a file per usual, then add extra configuration and scripts to watch the file for changes and submit the changes to syslog directly. I didn’t go with this approach, as I really prefer to have configuration for our project within generic Python settings, and not need extra, system-specific scripts and setup.
  2. Patch Celery: The Celery logging hooks and code are fairly straightforward. I could have added the patch and submitted upstream. Unfortunately, for the project at work, we are moving to fast to wait for changes, and as we are looking forward to some upstream Celery releases, I’d rather not maintain a private custom patch set through all of that.
  3. Monkey Patch Celery: In the same vein as the former option, the same hooks and changes could be applied as a monkey patch instead. This is what I eventually went with.

Patching Logging Handlers into Celery

Monkey patching is oft-controversial and generally discouraged practice, as getting things wrong is easy, and interactions within the patched library / code can get really messy. In our case, I reluctantly chose monkey patching, as it was a short patch, easy to disable, and I had no good other solutions.

Now that I’ve given the standard disclaimer, let’s get to the patch! The relevant Celery function we want to patch in celery.log._setup_logger():

Read more...

Rackspace Cloud Files and Servicenet

Cloud Files, Cloud Servers

At work, we use Rackspace Cloud Files for bulk blob storage of large sets of documents (in our case, patents). We use a farm of Rackspace Cloud Servers to process and serve blobs from Cloud Files. Rackspace’s offering has been very solid to date (modulo some idiosyncrasies of their particular APIs) and quite interestingly, the underlying software is being released as an open source project, OpenStack.

One of the nice things building a solution on Rackspace is that bandwidth is free between Cloud Servers and Cloud Files, if properly configured. However, as our recent experience bears out, there are some pitfalls and caveats towards getting everything set up correctly.

Read more...