Sometimes you think you don’t have the right tool for the job—but it’s really right there in the theory you learned in that math class.
An old Chinese proverb states that learning is a treasure that will follow its owner everywhere. You may need some of that treasure following you when you’re dealing with regexes (Regular Expressions). The outstanding trait of regex standards is that there are plenty of them. Old school regex tools like grep and sed, lack many operators present in regex of Java, Ruby, Perl, C#, Python, etc. But we’re about to see that modern operators are sometimes no more than syntactic sugar.
You’ve got loads of old C files. Unfortunately they are in Kernighan-Ritchie style and your C compiler only accepts ANSI C. You’re concerned about declarations of parameterless functions: in Kernighan-Ritchie C you write foo(), but in ANSI C you write foo(void). You want to find all lines ending in empty parentheses ()—ignoring all other lines—and insert “void” between the parentheses.
However, the problem is harder than that. You also want to ignore lines where the ending pair of parentheses are preceded by the C inline comment token: //. It’s easy to write a regex that solves this problem using negative lookahead. But you’ve got the tool sed built into your operating system (GNU/Linux as well as BSD/OS X) and sed doesn’t support such fancy things as negative lookahead. Is there a sed solution?
There were a lot of buzz words in the last paragraph. Let’s start from the beginning: What is negative lookahead and why is it useful in solving this problem?
You want to find lines ending in empty parentheses, but not if they’re preceded by slash-slash. We can match all lines ending in empty parentheses with the regex \(\)$—i.e. a left parenthesis pre-escaped by backslash, a right parenthesis pre-escaped by backslash and finally a dollar to assert it’s at the end of the line. Parentheses must be escaped, otherwise they are interpreted as group boundaries. Regex assert operators, like the dollar symbol, don’t consume anything from the source string (the string that we’re scanning); they just tell us if a certain condition holds.
Negative lookahead is another assert operator. It checks in advance that a pattern is not present, without consuming anything. It’s written as question mark followed by exclamation mark. The regex (?!.*//) is true if it’s impossible to find a bunch of characters followed by slash-slash. When we prefix with the caret symbol, to assert it’s at the start of the line, then we get the regex ^(?!.*//).*\(\)$—which is correct in, e.g., Ruby:
Ruby promises that the first attempt above is a perfect match—a line that must be redesigned in ANSI style. In the other two cases, Ruby says nil, which in Rubyish means that these lines have slash-slash and should not be touched.
Nice, but not usable since sed—and not Ruby—is our target tool right now. Negative lookahead is not the answer, when ERE is a part of the pre-conditions.
To see what a solution to our problem would be like, look at the first figure. This clock-shaped machine solves the slash-slash problem. The square window at the top of the machine reads a tape—from left to right—consisting of the input string, i.e., the C code. We have a state machine of type Nondeterministic Finite Automaton (NFA). To visualize and represent the automaton, we use a transition graph, in which the vertices represents states and the edges represent transitions. The initial state is identified by an incoming unlabeled arrow not originating at any vertex. The acceptance state is surrounded by a circle. This is the graph that you see printed on the clock dial. Now, whenever a new symbol is read from the tape, the clock dial rotates so that the drooping peak, just below the reading window, points at the current state. Ah, and this particular machine also has a special lookahead feature: it’s a long arm with an eye in the end and a light bulb. This eye can look ahead and tell if there’s any double slash. If there is, the bulb will glow and the machine will understand that it doesn’t matter if the input ends in a pair of parentheses—it’s in a comment anyway.
Regex syntax differs from tool to tool. The Basic Regular Expressions (BRE) standard and its more clever brother the Extended Regular Expressions (ERE) standard make regexes more compatible. The tool sed (as well as grep) implements both. By default, sed uses the simple BRE, but if we add the –r flag in GNU or –E flag in BSD, we get all the extra goodies in ERE. That’s exactly what we’ll do. Do we get an operator that says “don’t touch lines with slash-slash” in that case? I’m sorry to say we don’t. There are these things called character classes. If we write [/], we tell the engine to match a slash. And if we write [^/], we tell the engine to refuse lines with slash. But character classes only focus on one single position in the string, and your problem is to only ignore strings when they have a particular two-character-long substring: slash-slash—the evil twins. We’re just about to give up when we remember that set theory class we went to while studying mathematics at University.
Take a look at the Venn diagram. Two sets A and B have some area in common and each of them also has its unique area. Then there’s area outside both of them. Let’s use the following notation:
A: the area inside A
A': the area outside A—in other words, the complement of A
A u B: the area inside A, inside B, or inside both of them—this is the union of A and B
A n B: the area inside both A and B—we call this the intersection of A and B
(A n B)': the complement of the A and B intersection—everything except the area shared by A and B
Now, let’s say that A represents that there is a slash in position k in the source string (a line in the C code), for a given k. Let’s also say that B represents that there’s a slash in position k+1—and we’re talking about the very same k and the very same source string. We’re interested in (A n B)', which is everything except the intersection of A and B. Does this sound like gibberish? Then hear the translation into English: “The area where the following is not the case: there’s a slash in position k, as well as in position k+1.” I.e., assert it’s not two consecutive slashes.
You may not realize it, but we’re close to the solution now. Please meet Augustus De Morgan, British mathematician and logician during the 19th century. De Morgan renovated algebra and logic. He gave an inventory of the fundamental algebraic symbols (0, 1, +, - , ×, ÷, ()()), and letters—these only, all others are derived) and he gave an inventory of algebraic laws. Wait, that sounds familiar. Isn’t there a De Morgan’s Law? Yes, indeed. As a matter of fact, there are two of them. The second one goes like this:
Yes! The complement to the intersection in the Venn above can be called A' u B', or in plain old English: “there’s no slash in position k or there’s no slash in position k+1.” This seems like a reasonable statement when we want to assert that we don’t have slash-slash. Either there’s no slash here or else there’s no slash in the next position. If our sed regex can repeat that statement from the start of the line, up to the ending empty parenthesis, then we’ve got a candidate for replacement surgery.
Take a look at the third figure. Here’s another clock-shaped abstract machine. It solves the slash-slash problem without the lookahead-eyebulb-arm the other machine required. The square window at the top of the machine reads a tape—from left to right—consisting of the input string, i.e., the C code. As in the previous machine, there’s an NFA graph on the clock dial. And each time a new symbol is read from the tape, the clock dial rotates so that the drooping peak just below the reading window points at the current state. If the machine ends up in the state that is surrounded by a circle, then this is a perfect match. Right now the abstract machine is in the state where one slash has just been read.
In order to close this case, we need to combine the information above in a sed regex, do some testing and then add some final food for thought. First, the regex ([^/]|.[^/])* repeatedly asserts that we don’t have slash-slash. Second, we prefix it with caret, to assert it starts at the beginning of the line, then end it with backslash-escaped parentheses and finally get ^([^/]|.[^/])*\(\)$, when we also suffix it with dollar to assert that the parentheses are at the end of the line. That last regex finds all lines we want to touch—with no negative lookahead used. To touch the line is to keep what we got up to and including the left parenthesis, then add the ANSI C keyword void and finally append a right parenthesis. The back-reference symbol \1 refers to everything inside the first parentheses group in the find string:
Let’s do some back-of-the-envelope calculation with the unix tool sed:
It looks great! Notice that the first and the sixth C lines were changed, while the second, third, fourth, and fifth were left untouched. Mission completed—and we learned some valuable lessons. The rules of Regular Expressions are based on set theory, predicate logic, automata theory, and other standard mathematics. If we have the base theory in our repertoire, we can manage even without fancy operators like negative lookahead.
See, investment in knowledge always pays the best interest. And I also want to add four final words.
Four final words: Thank you De Morgan!
Staffan Nöteberg is the author of the focus booster Pomodoro Technique Illustrated. He loves programming, baklava and Turkish coffee—in that order. And he perseveres in the idea that finite automata are the coolest gadgets ever invented.