Brian converts a Java program to Scala and ponders why the Scala version is so much smaller.
I recently rewrote my Open Source Log4JFugue music project from Java into Scala. As a relative newcomer to Scala I did this as an opportunity to immerse myself in a modest-sized project. When I was done I was surprised that the code size had shrunk from 2500 to 250 lines. In fact, given that I have about six months of Scala and about 12 years of Java experience, this result was shocking.
During the last month I’ve tried to understand what factors about Scala allow it to do so much in so few lines of code. The following factors seem to apply:
reduced need to code for the lowest common denominator programmer
transformational rather than procedural approach
Of these factors the transformational approach caught my attention. Object Oriented coding is supposed to be about telling objects to do their thing. However, Java code seems to have become big glops of procedural code where we describe how to do a thing. In short, it feels far more coupled and far less cohesive than we were promised Back In The Day.
Scala on the other hand feels more focused on the transformation of data. In part this is due to the extremely rich library of transformations provided for the various List classes.
As an experiment I decided to implement a log processing method in both Java and Scala and compare the results. The method provides a type of log analysis that we routinely performed in my Video On Demand group. As a measure of system health we monitored the time between successive log messages from particular packages. Short delays indicated code hot spots while very long delays indicated subsystems that might be stuck.
For this experiment we simplified the problem in several ways. We assumed that the log file could fit into memory and that all log messages take the form <long time value> <category> < message>, and we ignored exceptions.
The Java code is fairly straightforward, with the only tricky bit being lines 17 and 18, which take 144 characters to define two variables. The code creates a HashMap to hold the previous timestamp keyed by category. Each time a log message is processed, we scan it to find the time and category. We then get the previous time for that category from the HashMap and calculate the difference. We append that result to the category’s vector of time differences in the TreeMap, taking care to initialize the vector the first time we see the second time entry for each category. Lines 28 through 36 then print the results as a series of comma-separated lists.
Now here’s the Scala version:
Considerably shorter. But what is that Scala version doing?
The Scala version takes a different approach, performing transforms on lists rather than manipulating mutable HashMaps. Since many readers will be unfamiliar with Scala, and since there are so few lines, we’ll go through the code essentially line by line.
Line three creates a list of the lines in the file; we don’t provide an explicit type for lines because the compiler infers that from the getLines function.
Line four iterates over the list of lines and for each one executes the anonymous code block after the yield. The result of this code block is added to the left-side variable tuppleList. While Java can only return a single object from a method, Scala supports the concept of a Tuple. The anonymous code block splits the log message line and then creates a two-element tuple of the first two entries of the split. So, line 3 transforms a List of Strings into a List of (timestamp, category) tuples. Not bad for a single line.
Line five almost looks like a database query in its use of the groupBy method, and the principle is the same. The groupBy method takes a list and creates a list of lists. Each top-level list consists of the value being grouped; i.e., the category and the sublist of all entries of the main list with that category value. In this case we group on oneTuple._2 which means the second item in the tuple.
Line six is called for each of the category sublists. It uses the sliding(2) method to pass a sliding window of size two over the list…in effect, looking at adjacent pairs of log messages of the same category. For each such pair it calculates the time difference between the first element of each tuple; i.e., logPairs(0)._1. Each such difference is added to the resulting diffList.
Line seven uses the underappreciated mkString method. This method converts a list into a string, with each element separated by the supplied string argument, in this case a comma. This single method call replaces lines 18 through 26 of the Java code and removes the worry of coding up a fencepost error in the list processing.
Starting with the import statements, Scala assumes that you might use the Scala library classes and for the most part imports them for you. If you are not familiar with Scala’s collections library, the preceding code will have been be challenging. By the same token, if you are not familiar with Java Generics the following line is also challenging:
In the Java code one has to pick through the scaffolding and boilerplate to find the algorithmic code. In the Scala version all of the code contributes to the algorithm. If you prefer Java over Scala, you might say this shows that Scala is too terse. If you prefer Scala over Java, you might say this shows that Java is too verbose.
In the Java version, the main data structures, the HashMaps, are constantly modified. In the Scala version, all of the variables are in fact vals, which means they are immutable. While this program isn’t multi-threaded, one could imagine a version that was. In such a case the Java mutable HashMaps would need synchronization protection while the Scala immutable vals would not.
When I code in Scala I feel like I’m coding in a higher level language. In this example, the problem statement solved by the function is basically “show the differences between successive log messages grouped by category.” The scala code is summarized by “for the lines in the file build a set of lists grouped by category and compute their differences.” Those summaries sound remarkably similar. The Scala code seems like a description of what to do, while the Java code is a description of how to do it (“add an element to the Vector inside the TreeMap unless that Vector is null in which case insert a new one….”)
Another way of looking at it is this. How many times in your career have you written the code to print a comma-separated list? How many times did you just ignore the extra comma after the last item in the list? How many times did you get the code exactly right the first time? Would you like all that accumulated time back? With the mkString function, no programmer should ever have to reinvent this particular wheel again. In Java before you can hammer a nail you have to go build a hammer. Scala just gives you the hammer and it’s up to you not to hit your thumb with it!
Brian Tarbox is a Distinguished Member of Technical Staff at Motorola in the Systems Engineering group designing next generation video products. He writes a blog on the intersection of software design, cognition, music, and creativity at briantarbox.blogspot.com. His primary contact point is about.me/BrianTarbox .