small medium large xlarge

Scala for the Intrigued

Working with Collections

by Venkat Subramaniam

Generic image illustrating the article
  In this fifth installment of his series on the Scala programming language, Venkat mixes object oriented and functional styles to reveal the power and grace of Scala collections.  

We looked at the object oriented side and the functional side of Scala in the last two articles in this series. These two programming styles interplay powerfully in collections—as we’ll see in this installment of the series.

A direct look at the Scala collections hierarchy can be overwhelming. But when distilled down, each Scala collection can be viewed as belonging to one of two categories: immutable or mutable. Some Scala collections are also lazy (read: efficient).

In both the immutable and mutable flavors, Scala provides implementations of Seq, Set, and Map. Seqs (lists) are ordered collections, sets are unordered, and maps are collections of key-value pairs.

In order to promote programming with no side-effects, Scala imports the immutable collections by default. One of the simplest immutable collections is the Scala list.

 val prices = List(10, 20, 15, 30, 45, 25, 82)

Scala allows us to iterate over the collections using an external iterator (where you control the iteration) or an internal iterator (where you only provide the action to perform for each element).

So, to print each of the elements in the list we created, we can use either

 for(price <- prices) { println(price) }


 prices.foreach { price => println(price) }

(or the concise form: prices foreach println)

Collections provide a wealth of functions to work with. Here’s how we can get the first element from the list:

 println("The first price is " + prices(0))

In Java we’re used to the array syntax instance[index] to access elements and instance.get(index) to access elements from an ArrayList. In Scala we use instance(index). That’s the syntax, but what’s going on behind the scenes? Recall that Scala does not have operators, and performs operations using methods. Scala has two special methods, apply and update, that can go into stealth mode. So, you are really calling instance.apply(param), but you can simply write instance(param). Similarly, instead of writing instance.update(param) = value, you can write instance(param) = value.

So, to access the fifth element, we can write

 println("The fifth element is " + prices.apply(4))

or, in short,

 println("The fifth element is " + prices(4))

We could use this approach and imperative style to select elements. However, internal iterators shine for this operation as well. To select the first price greater than 40, we can write

 println("First price > 40 is " + prices.find { price => price > 40 })

When we run this, the output will be

 First price > 40 is Some(45)

OK, we expected 45, but instead got Some(45). Scala is taking an extra step here to protect us from accidental NullPointerExceptions. If there is no price greater than 40, instead of returning null and expecting us to remember to do the null check, Scala returns an Option type. The option may either be None (non-existent—think of this like DBNull) or a Some value. So, we’re forced to consider the case where there may not be a valid value in response to the function call.

If we’d like all values greater than 40, we can write

 println("All prices > 40 are " + prices.filter { price => price > 40 })

While the find picks the first matching element, filter picks all matching elements.

The collection API is quite comprehensive. If we want to know if any element is greater than 90, we’d write

 println("Any > 90: " + prices.exists { price => price > 90 })

To know if all the elements are less than 90, we’d write

 println("All < 90: " + prices.forall { price => price < 90 })

If the prices above represent prices for some items in another list

 val items = List("A", "B", "C", "D", "E", "F", "G")

We can combine these two quite easily

 println("items and prices: " +

to get

 items and prices:
  List((A,10), (B,20), (C,15), (D,30), (E,45), (F,25), (G,82))

All the operations on a List (those above and a lot more) leave the list unmodified. If we want a new list, we will have to get a copy. For example, if we want to add a price to the beginning of the list, we’d write

 val theNewPricesList = 5 :: prices

The new prices list we created will have the element 5 at the beginning, followed by the elements from the old list. The old list, prices, however, is not affected.

Immutability here means we made a copy of the list, so you will naturally wonder, does that mean poor performance? Not necessarily. In the above example, we appended the new element to the head of the list. Scala List is smart enough to share the entire prices list in the creation of the theNewPricesList. Under the covers, the new collection theNewPricesList has one element (5) followed by a reference to the head of the prices list. So, even though it appeared to be a full copy, the List smartly avoided the overhead and reused the existing list in the creation of the new list.

So, the cost of creating a list with a new element in the head position is constant. However, that’s not the cost if we want to insert the new element in any position other than the head. Such simple sharing of the existing list will not work for other positions. Fortunately, Scala collection provides Vector. This is an implementation of Phil Bagwell’s tries, a smartly designed data structure (known as persistent data structures because their contents are immutable or persist over time) that allows efficient partial copying of elements to provide practically constant-time performance.

Tries are tree structures with large branching factors; each node has 32 or more children. So, a tries that’s only four levels deep can hold a million elements. The nodes are referenced using a special convention: the path to the nodes represents the keys. If the keys are numbers, the trie forms a list. For other keys, the trie would represent a hashmap.

Tries make shallow copies and share most of the subbranches when an element is inserted or removed. So, using Vectors, which are immutable, we can insert (or remove) elements at any position, and a copy is made for us with constant-time performance.

Let’s create a vector of prices.

 val prices2 = Vector(10, 20, 15, 30, 45, 25, 82)

On this Vector, we can perform pretty much all the iteration, find, filter, and zip operations we did with lists.

To create a new vector with an additional element at the head of the vector, we can write

 val theNewPricesList2 = 5 +: prices2

To create a new vector with an additional element at the end of the vector, we can write

 val theNewPricesList3 = prices2 :+ 85

If we only want to create new collections with elements added to the head, a List is adequate. However, for inserts or appends, prefer a Vector. For example, to append an element to a list of million elements on my machine it took 0.25 seconds, while it took only 25 micro-seconds for a vector of the same size.

To store a key-value pair or a dictionary, we can use a Map implementation.

 val langs = Map("Java" -> "Gosling", "Scala" -> "Odersky")
 println("Scala was created by " + langs("Scala"))

The map created this way is immutable. We can create new maps like so:

 println(langs + (("Lisp", "McCarthy")))

This new map will contain the key-values in the old map, plus the new key-value pair. Maps provide several convenience functions just like List and Vector.

We’ve only discussed immutable collections so far. As mentioned, we get this capability by default. To create mutable collections, we have to do an explicit import or reference to the class name.

 val capitals =
  scala.collection.mutable.Map("New York" -> "Albany", "Texas" -> "Austin")
 capitals("Oregon") = "Salem"
 println("After change: " + capitals)

The output from the above code is

 Map(Texas -> Austin, New York -> Albany)
 After change: Map(Oregon -> Salem, Texas -> Austin, New York -> Albany)

Unlike its immutable counterpart, the mutable version of the map provides ways to insert and remove keys from the collection. Likewise, a ListBuffer allows mutation as well.

From the concurrency point of view, immutable collections offer thread safety compared to the mutable versions. We have to select the appropriate implementation based on our need—both in terms of what the collections should do for us and the performance we desire out of them.

Scala also provides lazy collections, which allow us to postpone the creation to just in time or on-demand. This allows us to express some algorithms or logic in very succinct manner. Let’s see how laziness can be useful.

The code below will help us determine if a given number is prime.

 def isPrime(number : Int) = {
  val sqrtOfNumber = math.sqrt(number).toInt
  val hasFactorsOtherThan1AndItself =
  (2 to sqrtOfNumber).exists { i => number % i == 0 }
  number > 1 && !hasFactorsOtherThan1AndItself

Suppose we need to determine a list of prime numbers. We could express it like this:

 //Won't work
 def primes(number : Int) : List[Int] = {
  if(isPrime(number)) number :: primes(number + 1)
  else primes(number + 1)

Given a number, we determine if it is a prime number. If it’s prime, we prepend it to the list of all prime numbers greater than that number. Otherwise, we simply exclude that number and return all greater prime numbers. The code is quite expressive, but won’t work. As we ask the primes to be computed, the code will enter into an uncontrolled recursion and fail with a StackOverflowError.

It would be useful if the list of primes could be expressed as above, but the list itself was not computed until we asked for specific number of elements. Enter the lazy collection called Stream. Let’s change the code to use a Stream.

 def primes(number : Int) : Stream[Int] = {
  if(isPrime(number)) number #:: primes(number + 1) else primes(number + 1)

In this version we replaced the List[Int] with Stream[Int] and the prepend operator :: of List with the operator #:: for Stream.

When we invoke the primes function with a parameter of 1, the code runs until it hits the first call to #:: and pauses, deferring the computation of the collection to a later time. When we invoke the take(10), we are asking the first ten elements of the collection to be returned. At this time, the Stream will execute the deferred code enough times to fill the collection to the requested count. The output from the above code is

 List(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)

The lazy list gives us two benefits. The code is highly expressive, and we’re deferring computations until needed.

Here we used a special collection Stream instead of List. Streams exist to represent lazy collections, but the Scala API also provides easy ways to make straight collections lazy, using view.

Let’s consider a list of names and a function to determine the length of each name.

 val names = List("Steve", "Susan", "Mac", "Brad", "Gill")
 def len(n : String) = {
  println("Len for " + n)
  (n, n.length)

Suppose we want to create a tuple with the name and length, but then extract only the first name of certain length. Here’s the code for that, using the len function and the functions of the list.

 println(names map len find { e => e._2 == 3 })

The function len returns a tuple and the find function is asked to select a tuple where the second value (length) is 3. The output from the above code is shown next.

 Len for Steve
 Len for Susan
 Len for Mac
 Len for Brad
 Len for Gill

The len function was evaluated for each element and then the find function was evaluated for the first three values, until it yielded true.

Using a lazy collection we can eliminate wasted calls to the len function. It hardly takes any effort, as you can see next.

 println(names.view map len find { e => e._2 == 3 })

The output from the above code is:

 Len for Steve
 Len for Susan
 Len for Mac

The call to view turned the names list into a lazy collection. When you called map on the collection, rather than executing the given function, len, it cached away that computation for some later time. When find was invoked on the result, this was a call on a lazy collection that was produced by the call to map. So, rather than executing each function entirely from left to right on all the elements of the collection, the lazy collection executes them left to right entirely on one element, deciding along the way if it has to continue or not. So the computation was short-circuited and no further computation was done once find succeeded, selecting an element.

We’ve merely scratched the surface of the Scala collections API. Familiarize yourself with other classes and functions provided based on what your application may need or where your interests take you.

What’s next? Well, in this and previous article we used functions that accepted other functions. In the next installment we will see how we can create such functions.

Dr. Venkat Subramaniam is an award-winning author, founder of Agile Developer, Inc., and an adjunct faculty at the University of Houston.

He has trained and mentored thousands of software developers in the US, Canada, Europe, and Asia, and is a regularly-invited speaker at several international conferences. Venkat helps his clients effectively apply and succeed with agile practices on their software projects.

Venkat is the author of .NET Gotchas, the coauthor of 2007 Jolt Productivity Award winning Practices of an Agile Developer, the author of Programming Groovy: Dynamic Productivity for the Java Developer and Programming Scala: Tackle Multi-Core Complexity on the Java Virtual Machine. His latest book is Programming Concurrency on the JVM: Mastering Synchronization, STM, and Actors.

Send the author your feedback or discuss the article in the magazine forum.