Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on Redis, some based on a specific language implementation with threads, some based on a messaging system, and there's Beanstalkd.
In the past few months, I had the opportunity to work on a big scale backend project. The essence of this project is to take as input various, in typology and size, xml files. Complex validations and transformations are applied on this data, caches are prepared, the extracted data is stored and indexed, and so on. Each of these tasks is done by a well identified worker that can be written in ruby, php or perl.
This legacy system was built upon mysql, some polling and a big start launched every single day.
One of our many missions was to make it resilient to stopping / restarting, to allow input at any given moment, and eliminate any notion of beginning or end. But… without changing everything in either the inputs or outputs of the system. And… we get bonus points for a priority based system.
Oh, one last detail, this system needs to be distributed across dozens of servers. And, as you maybe already know, distributed systems are not easy beasts.
From the beginning, we chose to inverse the approach: no more polling. Workers do not ask for a job, they wait that another brick gives them one. We then needed a system that organizes, prioritizes and distributes jobs, hence a messaging system. We considered various options like RabbitMQ, ØMQ and Beanstalkd.
Those options could look surprising, since one is a "messaging server", another is a "socket library", and the last one is a "work queue". And yet, we had, at some extend, a proof of concept running of each of those solutions.
We ruled out RabbitMQ because it does not provide any easy and reliable priority management.
We ruled out ØMQ because it was not "enterprise-ready" enough, and as a framework, it demands way more effort than a "server as a binary that just works".
But, ØMQ, being such a beauty, will be the topic of an upcoming blog post.
So, why Beanstalkd ?
Basically, we needed to distribute jobs across a closed network, to prioritize them, and consume them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in
tubes, each tube
corresponding to a job type. Tubes are lazily created. A client
can listen to several Beanstalkd's tubes.
A producer can
put jobs in a tube, let's say a json representation, for
put accepts various options like
to define a priority on a job. This was a killer feature for our use case.
A consumer can
reserve a job: if no job is available, Beanstalkd will wait
until a job becomes available. The client has a limited amount of time to
process this job. When the
ttr (time to run) time out, Beanstalkd will push
back the job in queue. A consumer can
touch to push back a TTR.
When a job is successfully processed, a consumer can
delete the job from the tube.
In the case of failure, the consumer can
bury the job. This job will not be pushed
back to the tube, but will be available for further inspection.
A consumer can
release a job, Beanstalkd will push this job back in the tube,
and make it available for another client.
release is commonly used with the
delay option. This option tells Beanstalkd to push back this job in
the tube, but with a delay.
Beanstalkd was really the right tool for the job: full-featured, lightweight, language agnostic, and easy to administrate.
It is so easy to work with, we kind of forgot Beanstalkd is at the center of this system.
If you are interested in Beanstalkd, and you should be, the best read so far is the protocol description. It is an easy read and, by definition, an in-depth guide of Beanstalkd's features. Proving, Beanstalkd is a really well designed piece of software.