small medium large xlarge

Clojure Building Blocks

An Introduction to Clojure and Its Capabilities for Data Manipulation

by Jean-François “Jeff” Héon

Generic image illustrating the article
  Jeff introduces Clojure fundamentals and uses them to show why you might want to explore this language further.  

I mainly use Java at work in an enterprise setting, but I’ve been using Clojure at work for small tasks like extracting data from log files or generating or transforming Java code. What I do could be done with more traditional tools like Perl, but I like the readability of Clojure combined with its Java interoperability. I particularly like the different ways functions can be used in Clojure to manipulate data.

I will only be skimming the surface of Clojure in this short article and so will present a simplified view of the concepts. My goal is for the reader to get to know enough about Clojure to decide if it is worth pursuing further using longer and more complete introduction material already available.

I will start with a mini introduction to Clojure, followed by an overview of sequences and functions combination, and finish off with a real-world example.

Ultra Crash Course

Clojure, being a Lisp dialect, has program units inside lists. A function call will be the first element of a list, optionally followed by parameters.

For setup instructions, look here. Clojure programs can be run as a script from the command line, as a file from your IDE, or precompiled and packaged to be run as a normal Java jar. They can also be simply loaded or typed in the REPL, the interactive development shell. The REPL might be invoked from your IDE or simply called from the command line, provided you have java 1.5 or higher installed:

 java -cp clojure.jar clojure.main

I invite you to follow along with a REPL on a first or second read and try the examples and variations. You can display the documentation of a function with the doc function.

Entering the following at the REPL:

 (doc +) ;In Clojure, + is a function and not an operator

will echo the documentation. For the article, I precede REPL output with the > symbol.

 ([] [x] [x y] [x y & more])
  Returns the sum of nums. (+) returns 0.

For the curious, you can also display the source of a function with source.

 (source +) ;Try it yourself.

First, let’s start with the mandatory addition example.

 (+ 2 4 6)
 > 12

Values can be associated to a symbol with def.

 (def a 3)

The REPL will write the symbol name preceded by the namespace, #'user/a, in this case.

Typing “a” will return back its value.


The symbol is bound to the result of the expression after its name.

 (def b (* 2 a))

The str function will concatenate the string representation of its arguments.

 (str "I have " b " dogs")
 >"I have 6 dogs"

We can also string together characters. You’ll notice that character literals in Clojure are preceded by a backslash.

 (str \H \e \l \l \o)

It is common to manipulate data as collections, be it lists, vectors, or whatever. The apply function will call the given function with the given collection unpacked.

 (def numbers [2 4 6]) ;define a vector with 3 integers

(I will omit the echoing of the symbol name for the remainder of the article.)

 (apply + numbers) ;Sames as (+ 2 4 6)

Vectors are accessed like a function by passing the zero-based index as an argument.

 (numbers 0)


Clojure has many core functions operating on sequences. A sequence allows uniform operations across different kinds of collections, be it a list, a vector, a string, etc. In our examples, we will be using mostly vectors, an array-like data structure with constant-time access.

For example, the take function will return the n first elements.

 (take 3 [1 2 3 4 5]) ;Take 3 from a vector
 >(1 2 3)
 (take 3 "abcdefg") ;Take 3 from a string
 >(\a \b \c)

If you were expecting to get back the string “abc”, you might be disappointed by the result, as I was the first time I tried. What happened here? Operations producing sequences, like take, do not return elements in the original collection data type, but return a sequence of elements. That is why calling take on a string returns a sequence of characters. This means that take on the vector did not return a vector, but a sequence.

Let’s define a test vector to explore more sequence manipulations.

 (def days-of-the-week ["sunday", "monday", "tuesday",
  "wednesday", "thursday", "friday", "saturday"])

Oops! I forgot to capitalize the days. Let’s use map, which applies a function to each element of a collection and returns a sequence of the results. For example, the following returns a sequence of our numbers incremented by one.

 (map inc numbers)
 >(3 5 7)

First let’s develop a function to capitalize a word. Note that there already exists a capitalize function in the clojure.string namespace, but we’ll roll our own to demonstrate a few points. We’ll develop our function incrementally using the REPL.

We’ll start by getting the first letter of a word. The function first will create a sequence over the given collection and return the first element.

 (first "word")

Let’s use a bit of Java interop and call the static function toUpperCase from the Java Character class.

 (java.lang.Character/toUpperCase (first "word"))

So far so good. Now let’s get the rest of our word.

 (rest "word")
 >(\o \r \d)

What happens if we want to string our capitalized word together?

 (str (java.lang.Character/toUpperCase (first "word")) (rest "word"))
 > "W(\\o \\r \\d)"

We get back the string representation of the first argument, the letter W, concatenated with the string representation of the sequence of the rest of the word.

We need to use a variant of the function apply, which takes an optional number of arguments before a sequence of further arguments.

 (apply str (java.lang.Character/toUpperCase (first "word"))
  (rest "word")) ;Same as (str \W \o \r \d)

Now let’s make a function from our trials and tribulations.

 (defn capitalize [word]
  (apply str (java.lang.Character/toUpperCase
  (first word)) (rest word)))

The first line defined the function named capitalize taking one parameter named word. The second line is simply our original expression using the parameter.

Let’s try it out.

 (capitalize (first days-of-the-week))
 > "Sunday"

Good. We’re ready to capitalize each day of the week now.

 (def capitalized-days (map capitalize days-of-the-week))
 >("Sunday" "Monday" "Tuesday" "Wednesday"
  "Thursday" "Friday" "Saturday")

Map is an example of a high-order function, which has one or more functions in its parameter list. It’s a convenient way of customizing a function’s behavior via another function instead of using flags or more involved methods like passing a class containing the desired behavior inside a method.

Notice that the original collection is left untouched.

 > ["sunday" "monday" "tuesday" "wednesday"
  "thursday" "friday" "saturday"]

Clojure collections are persistent, meaning they are immutable and that they share structure. Let’s add a day to have a longer weekend.

 (conj capitalized-days "Jupiday")
 >("Jupiday" "Sunday" "Monday" "Tuesday"
  "Wednesday" "Thursday" "Friday" "Saturday")

Adding Jupiday has not modified the original collection capitalized-days, which is guaranteed not to ever change, even by another thread. The longer week was not produced by copying the 7 standard days, but by keeping a reference to the 7 days and another to the extra day. Various collection "modifications", which really return a new data structure, are guaranteed to be as or almost as performant as the mutable version would be.

Filtering operations can be done with the filter high-order function, which return a sequence of elements satisfying the passed-in function.

 (filter odd? [0 1 3 6 9])
 >(1 3 9)

When a function passed to an higher function is simple and only used once, there is no need to give it a name. We can define the function in-place. We just use fn instead of defn and forego specifying a name.

For example, here is another way of capitalizing our week days using an anonymous function.

 (map (fn [word] (apply str (java.lang.Character/toUpperCase
  (first word)) (rest word))) days-of-the-week)
 >("Jupiday" "Sunday" "Monday" "Tuesday"
  "Wednesday" "Thursday" "Friday" "Saturday")

Another handy sequence operation is reduce. It applies a function between the first two elements of a vector and then applies the function with the result and the 3rd element and so on.

 (reduce * [1 2 4 8]) ;Same as (* (* (* 1 2) 4) 8)
 > 64

Another form of reduce takes a parameter as the first value to combine with the first element.

 (reduce * 10 [1 2 4 8]) ;Same as (* (* (* (* 10 1) 2) 4) 8)
 > 640

Let’s sum the number of characters for each day.

 (reduce (fn [accumulator element]
  (+ accumulator (count element))) 0 days-of-the-week)
 > 50

We can redefine the previous anonymous function using syntactic sugar.

 #(+ %1 (count %2))

Note that we can omit the number 1 from the usage of the first argument.

 #(+ % (count %2))

Here is an example to extract the word three in three languages from a vector of vectors.

 (map #(% 3) [["Zero" "One" "Two" "Three"]
  ["Cero" "Uno" "Dos" "Tres"]["Zéro" "Un" "Deux" "Trois"]])
 >("Three" "Tres" "Trois")

Composition of Functions

Let’s explore function assembly with a wild example: capitalize and stretch.

Let’s define our additional function.

 (defn stretch [word]
  (apply str (interpose " " word)))

And test.

 (stretch "word")
 >"w o r d"

This would be a standard way of combining stretch and capitalize.

 (map (fn [word] (stretch (capitalize word))) days-of-the-week)
 >("S u n d a y" "M o n d a y" "T u e s d a y" "W e d n e s d a y"
  "T h u r s d a y" "F r i d a y" "S a t u r d a y")

Clojure also provides the comp function, which produce a new function from the successive application of the functions given.

 (map (comp capitalize stretch) days-of-the-week)
 >("S u n d a y" "M o n d a y" "T u e s d a y" "W e d n e s d a y"
  "T h u r s d a y" "F r i d a y" "S a t u r d a y")

Had we wanted to keep a capitalize-n-stretch function, we could have associated the result of the composition to a symbol.

 (def capitalize-n-stretch (comp capitalize stretch))
 (capitalize-n-stretch "Hello")
 >"H e l l o"

We can compose more than one function together and we can even throw in anonymous functions into the mix.

 (map (comp inc (fn [x] (* 2 x)) dec) numbers)
 >(3 7 11)

We can produce a new function by partially giving arguments.

 (def times-two (partial * 2))
 (times-two 4) ;Same as (* 2 4)

We can revisit our compose example differently.

 (map (comp inc (partial * 2 ) dec) numbers)
 >(3 7 11)

A Real-World Example

Here is an example of a real function I wrote to collect all the referenced table names for a specific schema. The SQL statements are peppered in various Java files. I call the extract-table-names function for each file, and a corresponding .out file is produced with the referenced table names, uppercased, sorted, and without duplicates. After processing the file, the name of the file and the table count is returned to be displayed by the REPL. The goal is not for you to understand all the program, just to have a feel of it.

 (ns article
  (:use [clojure.string :only [split-lines join upper-case]]))
  ;Import a few helper functions
 ;;Extract table names matching MySchema for a given line
 (defn extract[line]
  (let [matches (re-seq #"(\s|\")+((?i)(MySchema)\.\w+)" line)]
  ;We're using a regular expression
  (map #(% 2) matches)))
  ;Extract the table name (third item in each match)
 (defn extract-table-names [file-path file-name]
  "Extract MySchema.* table names from the java file
  and write sorted on an out file."
  (let [file (slurp (str file-path file-name ".java"))
  ;Get the file
  lines (split-lines file)
  ;Split the file by lines
  names (remove nil? (flatten (map extract lines)))
  ;Extract and remove non-matches
  cleaned-names (-> (map upper-case names) distinct sort)
  ;Uppercased, distinct only and sorted
  ;Write the file with unique sorted table names
  (spit (str file-path file-name ".out")
  (join "\n" cleaned-names))
  (str file-name ". Table count: " (count cleaned-names))))
 ;Usage example
 (extract-table-names "/DataMining/" "DataCruncher")

I’ve also used Clojure to extract running time statistics of our system and then generate distribution charts with Incanter, a wonderful interactive statistical platform.

This conclude my brief tour of data manipulation with Clojure. There is a lot more to sequences than what I’ve shown. For example, they are realized as needed, in what is referred to as lazy evaluation. There is an excellent summary of functions in the sequence section of the Clojure cheatsheet. Clojure functions can also be combined in other interesting ways like the thread-first or thread-last macros.

Jean-François “Jeff” Héon has been fascinated with programming ever since His parents got him a Commodore 64 in High School. He loves nagging his co-workers about new languages and frameworks. Jeff is most happy spending time with His wonderful wife and kid.

Send the author your feedback or discuss the article in the magazine forum.