No one designed this failure mode in, but no one designed it out either.
Several years ago, I worked for a company that offered 24x7 operations for websites that we didn’t create. As crazy as it sounds, we would not only support applications and systems that our clients built but also offer uptime and performance guarantees. During that time, I got what you could call a “crash course” in operations and in what it meant to build resilient systems. These systems taught me how well our typical applications were prepared to survive the harsh rigors of production. The answer was, “Not well at all.”
In this series, I relate some incidents from that time. Names have been changed to protect the parties involved, but the essential details and interactions are all accurate. You may find these vignettes entertaining or enlightening, or you may shake your head at how unprepared we all were. As you read, please bear Kerth’s Prime Directive in mind. “Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.” Also, it’s no fun to tell stories about times when everything went right!
Spot the Blocking Call
Can you find the blocking call in the following code?
You might suspect that globalObjectCache is a likely place to find some synchronization. You would be correct, but the point is that nothing in the calling code tells you that one of these calls is blocking and the other is not. In fact, the interface that globalObjectCache implemented didn’t say anything about synchronization either.
In Java, it is possible for a subclass to declare a method synchronized that is unsynchronized in its superclass or interface definition. Object-oriented purists will tell you that this violates the Liskov Substitution Principle. They are correct. (For more on the Liskov Substitution Principle, see the end of this article.)
You cannot transparently replace an instance of the superclass with the synchronized subclass. This might seem like nit-picking, but it can be vitally important.
The basic implementation of the GlobalObjectCache interface is a relatively straightforward object registry:
You should hear mental alarm bells when you see the synchronized keyword on a method. While one thread is executing this method, any other callers of the method will be blocked. In this case, synchronizing the method is the right thing to do.
(Some of you Java programmers might have seen an idiom called the double-checked lock that is meant to avoid synchronizing the whole method. Unfortunately, it just doesn’t work. See this article for a complete rundown on why it doesn’t work and why all the attempts to fix the pattern also don’t work.)
This method executes quickly, and even if there is some contention between threads trying to get into the method, they should all be served fairly promptly. (A word of caution, however. GlobalObjectCache could easily become a capacity constraint if every transaction uses it heavily. See Antipattern 9.1, Resource Pool Contention, on page 176 of my book Release It!: Design and Deploy Production-Ready Software for an example of the effect that blocked requests have on throughput.)
Part of the system had to check the in-store availability of items by making expensive inventory-availability queries to a remote system. These external calls took a few seconds to execute. The results were known to be valid for at least fifteen minutes because of the way the inventory system worked. Since nearly 25% of the inventory lookups were on the week’s “hot items” and there could be as many as 4,000 (worst case) concurrent requests against the undersized, overworking inventory system, the developer decided to cache the resulting Availability object.
The developer decided that the right metaphor was a read-through cache. On a hit, it would return the cached object. On a miss, it would do the query, cache the result, and then return it. Following good object orientation principles, the developer decided to create an extension of GlobalObjectCache, overriding the create( ) method to make the remote call. It was a textbook design. The new RemoteAvailabilityCache was a cache proxy, as described on pages 111–112 of Pattern Languages of Program Design 2. It even had a time stamp on the cached entries so they could be expired when the data became too stale. This was an elegant design, but it wasn’t enough.
The problem with this design had nothing to do with the functional behavior. Functionally, RemoteAvailabilityCache was a nice piece of work. In times of stress, however, it had a nasty failure mode. The inventory system was undersized (see Antipattern 4.8, Unbalanced Capacities, on page 96 of Release It!), so when the front end got busy, the back end would be flooded with requests. Eventually, it crashed. At that point, any thread calling RemoteAvailabilityCache.get() would block, because one single thread was inside the create() call, waiting for a response that would never come. There they sit, Estragon and Vladimir, waiting endlessly for Godot.
This example shows how these antipatterns interact perniciously to accelerate the growth of cracks. The conditions for failure were created by the blocking threads and unbalanced capacities. The lack of timeouts in the integration points caused the failure in one layer to become a cascading failure. Ultimately, this combination of forces brought down the entire site.
Obviously, the business sponsors would laugh if you asked them, “Should the site crash if it can’t check availability for in-store pickup?”
If you asked the architects or developers, “Will the site crash if it can’t check availability?” they would assert that it would not. Even the developer of RemoteAvailabilityCache would not expect the site to hang if the inventory system stopped responding. No one designed this failure mode into the combined system, but no one designed it out either.
Synchronization and the Liskov Substitution Principle
I mentioned the Liskov Substitution Principle, so maybe I should explain what that is.
In object theory, the Liskov Substitution Principle (see “Family Values: A Behavioral Notion of Subtyping” states that any property that is true about objects of a type T should also be true for objects of any subtype of T. In other words, a method without side effects in a base class should also be free of side effects in derived classes. A method that throws the exception E in base classes should throw only exceptions of type E (or subtypes of E) in derived classes.
Java violates that principle in the case of allowing a subclass to declare a method synchronized that is unsynchronized in its superclass or interface definition. Java does not allow other declared violations of the substitution principle. It is not clear whether the ability to add synchronization in a subclass was a deliberate weakening of Liskov or whether it was just an oversight.
Michael is a veteran software developer and architect. His background is either “well-rounded” or “checkered” depending on how charitable you'd like to be. He has worked in Operations (including for a “Top Ten” internet retailer), sales engineering, and as a technology manager and executive.
In 2007, Michael wrote Release It! to bring awareness of operational concerns to the software development community. This early influence in the DevOps movement showed developers how to write systems that survive the real world after QA.