ADesignChallenge
Background
We use social DRM on our eBooks—each book is customized with the purchaser’s name and other order information before being sent to them.
The systems that perform this customization are called gerbils. They run as EC2 instances.
Each gerbil instance has a local copy of the masters of each of the eBooks. When we update a book, the gerbils must refresh their local copies to pick up the latest version.
The gerbils send periodic requests for work to our admin server. The request includes version information on their local eBook cache. If the server determines the cache is out of date, its response tells the gerbil to refresh the cache. Otherwise it sends down the details of the next order to be fulfilled.
We currently store the eBook masters in a private Subversion repository (svn for various historical reasons). The gerbils do an svn up on this when they initially start and whenever the server tells them the cache is out of date.
The Problem
During periods of heavy load, we need to bring new gerbils online (sometimes up to 20 instances).
As each instance starts, it boots its AMI. That image includes a snapshot of the checked-out eBook repository as of the date the image was created. Each gerbil, as it starts, updates this cache to get the latest version. This involves checking out perhaps 500Mb of data per instance, as new books will have been added and existing books updated since the instance was frozen. As a result, it will often take 30–45 minutes to start 10 instances.
Your Challenge
Come up with an architecture that will dramatically shorten the time it takes to start an instance. The constraints are:
- you cannot assume there’s a master gerbil instance
- the solution must be robust (so we can drop and restart instances at will, gerbils will handle arbitrary restarts of the server)
- you’re free to move away from SVN

