A technique developed for automatic theorem proving back in the 1960s proves to be a gateway to many other applications built around matching.
We interrupt this series on dependent types for a brief diversion into an important algorithm in advanced programming languages: unification!
Unification is basically a way of matching pairs of expressions and combining the information in a consistent way, and it finds many uses in AI and in advanced programming languages. Unification was developed in the 1960s by J. A. Robinson as part of the resolution technique in automatic theorem proving. It is a cornerstone of the computation mechanism in Prolog and related logic programming frameworks, such as miniKanren.
It is used in Haskell and related languages as part of the type checker, particularly when polymorphic type variables are in use, where it is used to match type expressions and to propagate consequences. For example, map has type (a -> b) -> [a] -> [b] (which means, given a function converting some type 'a' to another type 'b', and a list of such 'a' values, then the result is a list of such 'b' values). If map is applied to a function of type Int -> String then the type checker expects the second argument to be a list of Ints and infers that the result will be a list of Strings.
Unification is also a powerful tool in the proof assistant tools for dependent types, where it is used for matching and filling in potentially any details—not just type variables, but potentially any part of your program. It’s also a key component of the automatic tactic commands which can fill in many of the obvious cases of proofs.
But I think unification is just a great piece of Computer Science to know and to study! It encompasses several useful theoretical ideas, has clear practical uses, and is a good gateway to even more interesting topics. It’s a good programming exercise too, since it covers quite a few techniques that are needed in (formal) language processing—and it serves as a non-trivial example of Haskell.
Plus, seeing a clear program for some theoretical idea is fantastic reinforcement for the underlying theory! I believe this is a great way to learn: we can write ideas down and experiment freely with them, and it seems to be the way many programmers do think—a better fit than the more traditional “here’s the theory” approach.
A few years back, I had the good fortune to put this approach into practice in a final-year undergraduate module on advanced logic and programming techniques: the first half of the module was a walk-through of implementing a simple theorem prover. A small but committed group of students signed up, plus a guy who decided I was the least bad option from a set of choices. I’m pleased to say “strategic choice guy” really enjoyed it, got good grades on it, and was himself pleased that he now really understood some theory he’d been required to memorize for the exam the previous year (and saw the point of it). Result!
We’ll be writing code to take two expressions and produce some result that indicates whether a match is possible and if so, under what circumstances the match works. The expressions are mainly basic functions & arguments style, with fixed function names and constants mixed up with “variables,” pretty much what you’d see in C. Some concrete examples:
0 and 0 are unifiable—they are the same constant
true and false are not unifiable, because they are different constants
f(0, g(true)) and f(0, g(true)) are unifiable—same function applied to unifiable arguments
the following (taken a pair at a time) are not unifiable: f(0, true), f(1, true), f(0, false), f(0, true, 2), and g(0,true) -- for reasons of: different arguments (twice), different arity (argument count), and different function symbol
The above are yes/no questions. It gets more interesting when we add “variables.” The quotes are deliberate since they aren’t variables inside the (programming) language, but instead variables at the next level up. It’s sometimes useful to call these “flex” variables to distinguish them from the conventional variables in the programming language—which we shall call “fixed” variables for obvious reasons. The point is that we look to refine (fill in) the values of the “flex” variables to see if two expressions can be made to look the same, possibly leaving some of the flex variables unsolved.
In the article, I’ll use a Prolog-style convention of starting the names of all flex variables with an uppercase letter, so X, X1, X_two, XYZ are all flex variables. In contrast, names with initial lowercase letters will be constants and fixed variables.
For example, are f(V1, g(x)) and f(y, g(V3)) unifiable? Yes, we can make them the same by changing the flex variables: in particular, replacing V1 with y and V3 with x, to get the unified expression as f(y, g(x)). We refer to the (flex) variable mapping as a substitution, and it is a mapping of (flex) variable to some sub-expression. When we apply the substitution to either of the input expressions, we will get back the same unified expression.
More formally, unification is a procedure for finding the most general unifier (MGU) for two expressions. The MGU, if it exists, is expressed as a substitution. The “most general” part is important: it means that it does the minimum work to make two expressions unify, imposing the smallest possible number of constraints. For example f(V1, V2) and f(V3, x) could be made identical with a substitution V1 => y, V3 => y, V2 => x, but this forces use of some (fixed) value y which could be a premature choice. The MGU here is V1 => V3, V2 => x, i.e. indicating that V1 and V3 should be the same something, but not constraining yet what that something is.
Finally, the flavor of unification covered here is the vanilla kind, called “first order unification”—which means flex variables can’t appear instead of function names, but only in place of function arguments. There are more powerful versions of unification, including various kinds of “higher order unification” where this restriction is relaxed in various ways, or kinds of unification under quantifiers (or binders), or kinds of unification which are interleaved with the computation mechanism of the language. I’ll leave you to read up on these if you are interested (but do check out Dale Miller’s work on “higher-order pattern unification” for a practical subset).
If you want to see the code from this article in full, in particular used as part of a simple resolution-based theorem prover, take a look at the github repo.
Basic Types: flex variables
Firstly, we need to represent the expressions to be unified. Let’s start with the flex variables. The underlying name will be a simple string, but we want to be careful how they are used, and attach some special functionality to them, so we use a common technique of wrapping the string up in a type of its own, just as we would encapsulate data inside a new class in Ruby.
We want VarName values to be equal when the underlying strings are equal, and Haskell’s deriving mechanism for type classes gives us this standard definition for free. For example, VarName "foo" == VarName "foo" will return True, but VarName "foo" == VarName "bar" will return False. Recall that == is overloaded via a Haskell “type class,” and the type checker selects the appropriate code to use based on the types involved. Next, we want to select how the overloaded show function behaves, but we don’t want the default version from deriving. So instead we explicitly declare this:
Which basically says, to show a VarName value, grab the wrapped-up string with pattern-matching and just return that string. So, show (VarName "foo") will return "foo".
Note that wrapping the strings up gives us some freedom to change the underlying representation later, i.e., gives us some encapsulation and abstraction. To protect this, we’ll also use the following function to build flex vars instead of the bare constructor (so if the representation changes then only the implementation of this function needs to change, and not all of the places it is used).
Basic Types: Expressions
Next we need a tree-like type to represent expressions. The Term type is defined as follows, where each term has two possibilities: either it is a (flex) variable OR it is a fixed name applied to zero or more terms.
Here are some example values in this type:
flex vars by themselves look like Var (mk_var "X1")
fixed vars and constants are represented as “function” symbols applied to zero arguments, so Fn "x"  or Fn "true" 
a function f applied to a flex var and a constant looks like Fn "f" [Var (mk_var "X1"), Fn "zero" ]
nested function calls like f(g(zero), X1) correspond to nested uses of Fn, e.g., Fn "f" [Fn "g" [Fn "zero" ], Var (mk_var "X1")]
We use Haskell’s default definitions of equality and showing, but the show results aren’t easy to read (but are fine for checking intermediate results and for debugging). Instead, we’ll define a pretty-printer to produce more convenient output. Haskell has several libraries for defining pretty-printers, typically providing a type for documents and many utility functions for building documents in various combinations. Since pretty-printing is going to be useful for several types, let’s have an overloaded function for it by using a type class.
This introduces a new class with a single “interface” function called pp which converts some value in a type into a Doc value. Flex var values need to be pretty-printable, so we define pp for VarName as follows, and it just wraps up the simple string form of the variable inside a document.
Next, we can pretty-print expression trees by defining each of the cases. Var nodes just get their contents pretty-printed—and note the use of overloaded pp on the VarName value. Function nodes are handled in two cases: names with no arguments are shown as is, and names with one or more arguments are shown in the usual way, with commas before each additional argument. Notice that the call to pp in the Var case is going to run a different version of pp because it’s called on a value of a different type. Haskell’s type class system manages all of this for us, selecting the right piece of code to use, without any need for extra annotations. We can just write what we mean, i.e., “just pretty-print this thing” and be confident that the right code will run.
The <> (cf Perl’s 'diamond' symbol) operator is provided by the pretty-printing library to join two document fragments together, and hcat joins a list of such fragments with <>. (Ok, this use of pretty-printing is a bit boring. The library also supports indenting, paragraph filling, line wrapping—all very useful in the full prover, but we don’t really need them here.)
An Important Operation: Collecting Variables
In various places we need to find out which flex variables appear in some value, be it a term or something more complex. It’s a notion we need in several places so a natural candidate for overloading again, hence we define another type class. A type can be in this type class if we provide a definition of vars_in for it.
Next, the definition of vars_in for Term. The leaf case is easy: if the node has is a single Var then return the singleton list of the flex var we find. The branch case requires a bit more thought.
Intuitively, we want the variables in all of the argument subtrees and then join them together. However, we need to be careful here. Have you spotted my bit of sloppiness yet? Have I used an appropriate type for the result value? Not really. Tsk! I should really be using a set-like container type rather than a simple list. Lists allow duplicate elements (and quite rightly so), but for the way we’re going to use collections of variables, duplicates will mean useless extra checking and hence slower code. Something like a set type, including something like a Ruby hash, is a better match since variables would only appear once. Of course, representation here only affects efficiency and doesn’t affect the correctness of the code, but it’s a good reminder for us to always be on our guard and to use the type system to keep us honest...
Anyway, the point is, when we merge the results from two or more calls to vars_in, we need to remove duplicates. Let’s use a powerful feature of Haskell to encode this detail once and for all, and ensure consistent treatment. The code is this:
It literally says, a list of things can be in the ContainsVars type class if the base element already is; and the “constructive proof” (see last month’s article) of this is a definition of vars_in which produces the appropriate result by calling vars_in on each of the elements in the input list and then doing a big (set-style) union on the lists to remove duplicates. (FWIW I’m using foldl for technical reasons concerning how union works.)
This is quite powerful stuff! It allows us to call the overloaded function vars_in on a list of things, and we can be sure that the resulting value has no duplicates. The mechanism works recursively too, so vars_in [[[Var v ]]] (i.e., in a triply-nested list) will return [v] because the code above will be used three times to flatten the structure and flatten it in the right way. Few other languages provide this kind of powerful support.
Back to the code for Term though. With the general instance rule above, our code is nicely simple now. Just this!
The traditional presentation of substitutions is something like a hash, storing an expression for each variable to be replaced. We’re going to be a bit more abstract, and use a type class to represent this abstraction. (Implementing an actual instance for this is left as an exercise.) What do we need from a substitution? Basically, a way to see if a given (flex) variable has a replacement term, and ways to build and combine substitutions. Any type being used as a substitution must therefore supply these four basic definitions:
The type class has ContainsVars as a 'superclass', which means that all instances of the class must also be in the ContainsVars class. This is mainly to encode the idea that vars_in is useful in combination with substitutions, and frees us from saying that a substitution type also has vars_in defined for it; but (in other cases) the presence of a superclass can reflect an expectation that the parent class’s functionality is usually necessary in the child class’s instances, i.e., hard to define without it.
The lookup_var function takes a substitution value and a flex variable, and returns Nothing if the variable is not in the substitution, else returns some Just t if a term is found. Notice that the Maybe concept is a natural fit here: it says exactly what can happen, and avoids messy exceptions. It also forces calling code to handle both cases explicitly. A single substitution mapping is built with an infix operator +->, for example as mk_var "X" +-> Fn "foo" [Fn "y" ]. Bigger substitutions are formed from two smaller ones with extend_with, and emptySubstitution provides a 'zero' value.
Something’s still missing, of course, and that’s more detail about how the various functions interact, and what restrictions should apply. For example, we require a properties like lookup_var (s1 `extend_with` v +-> e) v == Just e, or emptySubstitution `extend_with` s2 == s2. Ideally we would prove these inside a system like Idris (see previous articles), but in Haskell that’s not really an option so instead we could try some QuickCheck property tests, or traditional tests, or even write down the proof in the comments. What about cases like v +-> Fn "f" [Var v]—is this valid?
For now—to save time and space—let’s just assume that the code does what we guess it does and that nasty things won’t happen. It’s not ideal, but the key thing is that you realize that some gaps are left and that you have some idea about ways to address the gaps!
Equipped with a notion of substitution, we can start to explain how to apply substitutions to various structures. Again, it’s an operation we want to overload, hence we use another type class. There’s a new detail here, of a member function’s type containing a class constraint: in this case, it means the code for apply may rely on its first argument implementing the above interface for substitutions, but it doesn’t—and need not— rely on a particular concrete implementation.
Defining substitution for Term can be done by cases, again. When substituting on a flex var, we check to see if the var is in the substitution mapping and, if so, we return the replacement term. If not, we return the original term. For the branch case, we just apply the same substitution to all of the child terms. Similar to vars_in, we can deal with the case of apply over a list with a single rule (just map the substitution over the list) and hence shorten the branch case.
This is getting a bit abstract, so let’s end this section with a brief concrete example.
apply (mk_var "X" +-> Fn "y" ) (Fn "f" [Var (mk_var "X")]) will return Fn "f" [Fn "y" ] after traversing the tree and replacing the flex var X with the constant y.
Make sure you understand how the code above arrives at this result.
And So On to Unification
The groundwork is in place, so it’s time for the main attraction. We’ll be using unification on a pair of terms, but potentially the notion can apply to other structures so let’s overload the name. Cue a type class:
Slightly more complex than before, but let’s consider it a piece at a time. Firstly, it promises a function unify :: t -> t -> Maybe s, with a few additional constraints. Firstly, the result is Maybe s to represent the fact that unification can fail (with Nothing) or succeed (with Just x where x is the resulting MGU). There’s a constraint on the type of unify which requires it to return something that is a substitution, which means the result is at least relevant to the task in hand! There’s also a superclass constraint, which means we can only define unification on types which already can have a substitution applied to them. To sum up, this says we can define substitution on a pair of values when we know how to substitute over them, and the result should be usable as a substitution.
It’s probably not a surprise now that I want to define how to unify lists of things before we proceed to unify terms. Informally, two lists [a, b, c] and [a', b', c'] are unifiable if each pair of elements a, a', b, b', c, c' are all unifiable and there are no elements left over (i.e., the lists are the same length). However,the interesting part is that we want to propagate information from the first match into the rest of the matches, thus propagate the information that was learned from the first match so as to simplify or constrain the remaining matches. We also want to stop matching when one of the pairs fails, because there's no point in continuing. Let’s see the code now:
The first line says, we can unify lists of things if we can unify the individual things. The first case says that two empty lists match, and the substitution is the empty or 'zero' one. The next two lines catch the unequal lists cases, and both return the 'fail' result. (We could also test the length of the lists first, but I opted to keep the code simpler.) The final case is the interesting one. You can read the code as: do the first match on the two head elements and get substitution s1, then apply s1 to both of the remaining lists and try to unify the results to get another substitution s2, and finally return the combination of s1 and s2. Makes sense?
What about the failing cases, you ask? Well, that’s the joy of the monad pattern. Recall that this pattern allows us to pipeline computations together whilst hiding some of the plumbing. The error handling is the plumbing here, and if that’s hidden we are left with the key details of the algorithm, as if we did not need to handle the errors. It really helps, I hope you agree.
For the record, here’s the monad instance for Maybe. The key lines are marked. Line 2 says, if the first value in a sequence fails, then fail overall. Line 3 says, if the first value succeeded and produced a value, then pass this value to the next stage and continue. This precisely the boring and repetitive plumbing that we want to leave out of the main code.
Now let’s consider terms. There are four cases to consider, namely the four combinations of var nodes and branch nodes. The first case (lines 1-3) is matching flex var to flex var: if the variables are the same, then succeed with no further substitution required, else substitute one for the other. (It doesn’t really matter which one is replaced in this kind of unification, as long as we replace consistently...)
The second and third cases, of a flex var vs a branch node, are mirror images and we can consider them together. The intuitive action is to substitute the var for the branch value, but we can’t do this unconditionally. The risk is that the flex variable might appear in the branch term, and this would lead to an infinite loop when applying the substitution (because we could effectively get substitution V1 +-> Fn "h" [Var V1]), so we first perform an “occurs-check,” test for the flex var appearing in the list of variables from the branch term, using the vars_in function defined earlier. If the check is True, then we can’t unify and so return Nothing. Otherwise, it’s safe to build the obvious substitution.
The final case compares two branch expressions. The function names must match exactly, hence return Nothing if they differ. Otherwise, we unify the lists of argument values. Happily, the Unifiable [a] instance above makes this easy.
I hope you now understand how unification works, but there are several questions left unanswered, both about the general algorithm and the code I’ve presented. The over-arching one is, does either of them actually work? This question has several aspects:
will the algorithm always terminate (and not drop into an infinite loop)?
are there any missing cases?
is it really returning the most general result, as claimed?
will it always return the correct result?
I’m not going to attempt any answers this time. A good place to start looking for more information is a paper by McBride that uses the richer tools in dependent types to develop a version of unification where some of the key properties just drop out as consequences of the program passing the type checker.
Dr Paul Callaghan first learnt about type theory in 1996, and was immediately fascinated. It filled in many of the unsatisfactory gaps in conventional functional programming, and helped him understand why some programs in Haskell felt wrong and incomplete. Paul was fortunate to do research work with some very talented people from 1997 to 2007, during which time he implemented a niche proof assistant called Plastic and experimented with various ideas like DSLs, Mathematical Vernacular, and Coercive Subtyping. Other bits of his bio can be seen on earlier articles. Paul currently works on a variety of bespoke rails apps at TranscendIt, and thanks TranscendIt for its continued flexibility with his writing schedule! Paul also flies big traction kites and can often be seen being dragged around inelegantly on the beaches of North-east England, much to the amusement of his kids. He blogs at free-variable.org and tweets as @paulcc_two.