Mike Nygard tells another story of his adventures trying to keep large websites up and running. This time it’s a site that went down every morning at 5 A.M.
Several years ago, I worked for a company that offered 24x7 operations for websites that we didn’t create. As crazy as it sounds, we would not only support applications and systems that our clients built but also offer uptime and performance guarantees. During that time, I got what you could call a “crash course” in operations and in what it meant to build resilient systems. These systems taught me how well our typical applications were prepared to survive the harsh rigors of production. The answer was, “Not well at all.”
In this series, I relate some incidents from that time. Names have been changes to protect the parties involved, but the essential details and interactions are all accurate. You may find these vignettes entertaining or enlightening, or you may shake your head at how unprepared we all were. As you read, please bear Kerth’s Prime Directive in mind. “Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.” Also, it’s no fun to tell stories about times when everything went right!
One of the sites I launched developed this very nasty pattern of hanging completely at almost exactly 5 A.M. every day. This was running on around thirty different instances, so something was happening to make all thirty different application server instances hang within a five-minute window (the resolution of our URL pinger). Restarting the application servers always cleared it up, so there was some transient effect that tipped the site over at that time. Unfortunately, that was just when traffic started to ramp up for the day. From midnight to 5 A.M., there were only about 100 transactions per hour of interest, but the numbers ramped up quickly once the East Coast started to come online (one hour ahead of us Central Time folks). Restarting all the application servers just as people started to hit the site in earnest was what you’d call a suboptimal approach.
On the third day this occurred, I took thread dumps from one of the afflicted application servers. The instance was up and running, but all request-handling threads were blocked inside the Oracle JDBC library, specifically inside of OCI calls. (We were using the thick-client driver for its superior failover features.) In fact, once I eliminated the threads that were just blocked trying to enter a synchronized method, it looked as if the active threads were all in low-level socket read or write calls.
The next step was tcpdump and ethereal. (Ethereal has since been renamed Wireshark.) The odd thing was how little that showed. A handful of packets were being sent from the application servers to the database servers, but with no replies. Also nothing was coming from the database to the application servers. Yet monitoring showed that the database was alive and healthy. There were no blocking locks, the run queue was at zero, and the I/O rates were trivial.
By this time, we had to restart the application servers. Our first priority is restoring service. We do data collection when we can, but not at the risk of breaking an SLA. (Service-level agreement: a contractual obligation to provide a service to a measurable, quantitative level. Financial penalties accompany the violation of an SLA.) Any deeper investigation would have to wait until it happened again. None of us doubted that it would happen again.
Sure enough, the pattern repeated itself the next morning. Application servers locked up tight as a drum, with the threads inside the JDBC driver. This time, I was able to look at traffic on the databases’ network. Zilch. Nothing at all. The utter absence of traffic on that side of the firewall was like Sherlock Holmes’ dog that didn’t bark in the night—the absence of activity was the biggest clue. I had a hypothesis. Quick decompilation of the application server’s resource pool class confirmed that my hypothesis was plausible.
Socket connections are an abstraction. They exist only as objects in the memory of the computers at the endpoints. Once established, a TCP connection can exist for days without a single packet being sent by either side. (Assuming you set suitably perverse timeouts in the kernel.) As long as both computers have that socket state in memory, the “connection” is still valid. Routes can change, and physical links can be severed and reconnected. It doesn’t matter; the “connection” persists as long as the two computers at the endpoints think it does.
There was a time when that all worked beautifully well. These days, a bunch of paranoid little bastions have broken the philosophy and implementation of the whole Net. I’m talking about firewalls, of course.
A firewall is nothing but a specialized router. It routes packets from one set of physical ports to another. Inside each firewall, a set of access control lists define the rules about which connections it will allow. The rules say such things as “connections originating from 192.0.2.0/24 to 192.168.1.199 port 80 are allowed.” A TCP connection attempt starts when one host sends a SYN packet to another. When there’s a firewall in the way, it gets to make a decision. When the firewall sees an incoming SYN packet, it checks it against its rule base. The packet might be allowed (routed to the destination network), rejected (TCP reset packet sent back to origin), or ignored (dropped on the floor with no response at all). If the connection is allowed, then the firewall makes an entry in its own internal table that says something like “192.0.2.98:32770 is connected to 192.168.1.199:80.” Then all future packets, in either direction, that match the endpoints of the connection are routed between the firewall’s networks.
So far, so good. How is this related to my 5 A.M. wake-up calls?
The key is that table of established connections inside the firewall. It’s finite. Therefore, it does not allow infinite duration connections, even though TCP itself does allow them. Along with the endpoints of the connection, the firewall also keeps a “last packet” time. If too much time elapses without a packet on a connection, the firewall assumes that the endpoints are dead or gone. It just drops the connection from its table, as shown in the figure. But TCP was never designed for that kind of intelligent device in the middle of a connection. There’s no way for a third party to tell the endpoints that their connection is being torn down. The endpoints assume their connection is valid for an indefinite length of time, even if no packets are crossing the wire.
After that point, any attempt to read or write from the socket on either end does not result in a TCP reset or an error due to a half-open socket. Instead, the TCP/IP stack sends the packet, waits for an ACK, doesn’t get one, and retransmits. The faithful stack tries and tries to reestablish contact, and that firewall just keeps dropping the packets on the floor, without so much as an “ICMP destination unreachable” message. (That could let bad guys probe for active connections by spoofing source addresses.) A Linux box, running on a 2.6 series kernel, has its tcp_retries2 set to the default value of 15, which results in a twenty-minute timeout before the TCP/IP stack informs the socket library that the connection is broken. The HP-UX servers we were using at the time had a thirty-minute timeout. That application’s one-line call to write to a socket could block for thirty minutes! The situation for reading from the socket is even worse. It could block forever.
So the stack trace claimed that everyone was busy with database calls, but the network trace showed that no database work was happening. What next? Time to decompile something.
When I decompiled the resource pool class, I saw that it used a last-in, first-out strategy. During the slow overnight times, traffic volume was light enough that one single database connection would get checked out of the pool, used, and checked back in. Then the next request would get the same connection, leaving the thirty-nine others to sit idle until traffic started to ramp up. They were idle well over the one-hour idle connection timeout configured into the firewall.
Once traffic started to ramp up, those thirty-nine connections per application server would get locked up immediately. Even if the one connection was still being used to serve pages, sooner or later it would be checked out by a thread that ended up blocked on a connection from one of the other pools. Then the one good connection would be held by a blocked thread. Total site hang.
Idle Connection Dropped by Firewall
Once we understood all the links in that chain of failure, we had to find a solution. The resource pool has the ability to test JDBC connections for validity before checking them out. It checked validity by executing a SQL query like SELECT SYSDATE FROM DUAL. Well, that would just make the request-handling thread hang anyway. We could also have the pool keep track of the idle time of the JDBC connection and discard any that were older than one hour. Unfortunately, that involves sending a packet to the database server to tell it that the session is being torn down. Hang.
We were starting to look at some really hairy complexities, such as creating a “reaper” thread to find connections that were close to getting too old and tearing them down before they timed out. Fortunately, a sharp DBA recalled just the thing. Oracle has a feature called dead connection detection that you can enable to discover when clients have crashed. When enabled, the database server sends a ping packet to the client at some periodic interval. If the client responds, then the database knows it is still alive. If the client fails to respond after a few retries, the database server assumes the client has crashed and frees up all the resources held by that connection.
We weren’t that worried about the client crashing, but the ping packet itself would be enough to reset the firewall’s “last packet” time for the connection, keeping the connection alive. Dead connection detection kept the connection alive, which let me sleep through the night.
Michael is a veteran software developer and architect. His background is either “well-rounded” or “checkered” depending on how charitable you'd like to be. He has worked in Operations (including for a “Top Ten” internet retailer), sales engineering, and as a technology manager and executive.
In 2007, Michael wrote Release It! to bring awareness of operational concerns to the software development community. This early influence in the DevOps movement showed developers how to write systems that survive the real world after QA.