Moving logging from the province of the developer’s discretion to the circle of architecture and design results in just-in-time logging.
Logging is the poor stepchild of many applications, even though for many enterprise applications it is the primary means of diagnosing problems. You cannot, after all, attach a debugger to Amazon.com.
Many existing products simultaneously suffer from excessive logging and insufficient information in those logs. This often proves to be an enormous performance penalty, and makes diagnosing systems in the field problematic. We often write so many hundreds of thousands of lines of logs that most support personnel have developed libraries of grep routines to throw away 99% of the text. In addition, so much is written to the logs that they roll over too quickly and the information about the original cause is often lost. Over the years many teams have repeatedly attempted to trim the amount of logging, with limited or temporary success.
Insanity is defined as doing the same thing and expecting different results. It is time to take a new approach to logging. The new approach proposed here adds formality and rigor to the topic of logging, while at the same time improving performance and ease of post-hoc analysis.
Details of the Problem
While we have at times enforced a discipline of design with regard to large matters such as architecture, the field tends to treat logging as something left to the discretion of the individual developer. We have quite appropriately wanted to leave developers with some freedom of action in which to exercise their craft. This approach, while admirable, has nonetheless had its drawbacks.
Few teams exhibit discipline about which messages should be marked debug and which should be marked info. Fewer still exhibit discipline about putting the relevant noun in the log messages, where the noun would be the ID of the thing being described by the message. This has meant that very often in order to have access to sufficient information one needs to log at the worst level; i.e., debug. This in turn has resulted in our systems logging literally millions or tens of millions of lines per hour. This has had an enormous negative effect on our overall performance, to say nothing about the increased difficulty of managing the system.
Previously Attempted Solutions That Have Failed
Over the last ten years of working with various teams I’ve seen any number of bug scrubs or decrees issued. “All log messages will contain nouns,” “all debug messages shall be protected by if(log.isDebugEnabled) tests,” and “the number of log messages shall be strictly limited.” Each of these endeavors resulted in a significant and brief improvement, which was quickly overcome by the entropy of unstructured coding.
Suggesting that we perform another such log scrub flies in the face of reality. Another approach is required.
The proposed solution moves logging from the province of developer’s discretion to the circle of architecture and design. It is to be something that is designed and proscribed. The proposed solution is similar to a logging solution that was successfully implemented for stored procedures. The approach recognizes that different levels of information should be logged for successful versus unsuccessful operations. In a perfect world any failure could be fully analyzed from the single log message written at the point of detecting the failure. This is not usually the case, however. If a video stream create fails because the selected streamer has no remaining resources it is interesting to know how the streamer was selected. Successful operations by definition have fully completed, while failed operations by definition do not fully complete.
The problem is that until the operation completes one cannot tell whether it will be successful or not. The stored procedure logging solution was to write log messages into an in-memory-temp table in the database. This operation is close to free in terms of performance. If the stored procedure succeeds the temp table is dropped. If the stored procedure fails it performs a bulk insert from the temp table into a normal database log table.
The log4j approach is to buffer all log messages in a HashMap indexed by the relevant noun. This is a very fast operation and so one can write as many log messages as one wants. When the operation completes, the system does one of two things. If the operation failed the HashMap contents can be stringified and the buffer can be written in a single step incurring only a write. If the operation succeeded we assume that the string buffer contains far more information than we need, so we write a summary of that buffer. So in the success case we write a single line of output to the log. Since we assume that in a normally running system at least 95% of operations will succeed, this approach ought to reduce the logging volume by close to 90%. This will have a dramatic effect on how long log files live. That would be a game changer in terms of the debuggability of systems both in our labs and in the field. This alone might be reason enough to make the change.
There are several parts to the suggested implementation. The first part is a new buffered logger class that extends the standard log4j Logger class. This class contains a static ConcurrentHashMap<Long, StringBuffer> to hold the various log messages. It also contains overloaded versions of the standard info, debug, and warn methods. These overloaded versions take two parameters: the first being the noun ID and the second being the message. Instead of writing a log message like this:
you will instead write:
The result of this call will be that your new message gets appended to the string buffer of messages associated with that particular noun instead of being directly written to the log. The BufferedLogger class supports a flush method that is called when the client wants to output the accumulated string buffer. The flush method contains a normal log4j logger which writes the stream buffer to the actual server.log file.
There is also a remove method that is called when an operation succeeds. This method takes a nounID and a string message. It removes all entries for the nounID from the hash map and logs the supplied success message to the real log4j log file. The other major part of the implementation is of course the changes to client code to use the new approach. Please note that only high volume operations need to be converted to this approach. To use this style of logging a client class would create a static BufferedLogger just as we normally create a standard logger at the top of each class. One could even create both a standard logger and a BufferedLogger within the same class.
For those log messages associated with a given verb each log.info(message) call would be changed to log.info(msid, message). This means that each log message involved in the processing of a particular verb must include the nounID, but after all that is a goal we have been striving for unsuccessfully for years. This approach simply mandates it by including the nounID as part of the method signature.
The verb operation must be surrounded by the appropriate try, catch, finally blocks so that the code can determine if the operation succeeded and so the buffered logger can call flush or remove. Again that structure should already be in place.
Enforcing the New Approach
There is actually a very simple way to ensure that this new approach is used. You can create a unit test that calls the code for the verb in question. In that test’s overload Logger.createInstance (which is how code obtains the old style log4j logger instance), the overload will return a mock logger instance that calls assertTrue(false) if it is called. So, if any code calls the old-style logging during stream create the unit test will fail. In this way we do not rely on developers to follow the rules, we make rule enforcement automatic.
What about Message Formatting?
Another area that can yield dramatic improvements in system performance is the proper use of formatting directives in log messages. It’s hard to believe but I’ve seen numerous groups use formatting directives whose JavaDoc says “never use this directive in a production system.” The worst example of this is the %M directive which uses Java Introspection on each and every log call to determine the name of the method being logged. Tests show that the presence of this directive can have a massive effect on the overall performance of the system.
The following graphs show the time to execute a key system verb on systems running with and without the %M directive. Note that removing this directive caused the overall system throughput to dramatically increase.
Since the proposed change to logging already involves looking at the log messages that are emitted during the critical paths it is here asserted that the extra cost of changing those messages to indicate where they come from is small, especially compared to the benefit gained by using log4j as its designers intended.
The upside of this approach is
An enormous reduction in the amount of logging that we perform
greatly increased lifetime of log files allowing one to deal with customer issues more than 36 hours old
automatic grouping of log messages for a given operation
high level of information available for trouble shooting failures (in fact given that we will be logging so little we may well be able to add more debug level information knowing that it will only be written when needed)
more informative log messages since the new logging method signatures require a noun ID
Brian Tarbox is a Distinguished Member of Technical Staff at Motorola in the Systems Engineering group designing next generation video products. He writes a blog on the intersection of software design, cognition, music, and creativity at briantarbox.blogspot.com. His primary contact point is about.me/BrianTarbox .