Groovy: Don't Fear the RegExp

| Comments

UPDATE: if you’re using Groovy 1.6.1 or greater (released April 2009), check out the new find and find all methods in this post.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions!” Now they have two problems. – Jaime Zawinski

There is a common and well-earned aversion in the Java world to regular expressions. Prior to Java 1.4, regular expressions weren’t even part of the core language. Post 1.4, using regular expressions is still a painful task of working with Pattern and Matcher objects. Lots of typing is involved to make anything happen. It’s difficult enough that most Java devs don’t end up using them enough to actually remember how to read a regular expression, and they need to dig up the JavaDocs (or cut and paste an old example), every time they want to use them.

This aversion has persisted into the Groovy community to a level that I haven’t seen in other dynamic scripting languages like Ruby, Python, and (obviously) Perl.

The current regexp docs that pop up when doing a google search are all outdated and don’t use any of the best techniques that are available in the groovy 1.5.X and 1.6-beta code that is now available. The recent Groovy Recipes book doesn’t have an entry for regular expressions in the index, and I was unable to find a single example of a regular expression in the entire book.

This is unfortunate because Groovy makes using regular expressions much easier than in Java. Under the covers, you’re still working with the same old Java Pattern and Matcher objects, but the Groovy syntax and additions to those classes are pleasant to work with.

String Escaping with Slashy Strings

Groovy adds a new type of string escaping, Slashy Strings, that can be used to make your regular expressions easier to read. Forward slashes around text create String objects, just like quotes do. Unlike quoted strings, you don’t have to escape backslashes with another backslash in a Slashy String:

assert java.lang.String == /foo/.class
assert ( /Count is \d/ == "Count is \\d" )

You can also use groovy expressions Slashy Strings, just like double-quoted GStrings:

def name = "Ted Naleid"
assert ( /$name/ == "Ted Naleid" )
assert ( /$name/ == "$name" )

There isn’t anything specific to regular expressions with Slashy Strings, but many regular expressions use shorthand character classes such as \d (digit), \s (non-whitespace character), \b (word boundary) etc. The JavaDocs for Pattern actually has a nice reference for regular expression character classes if you’re not familiar with them.

Groovy Regular Expression Operators

Groovy adds 3 new operators

  • ~” - used before a string and it will cause the string to be compiled to a Pattern for later use
  • // \b means word boundary, [A-Z] means any capital letter, + means one or more
    // so this matches any string of one or more capital letter with a word boundary (non-word character) on either side of it
    def shoutedWord = ~/\b[A-Z]+\b/           
  • =~” - Creates a Matcher out of the String on the left hand side and the Pattern on the right.
  • def matcher = ("EUREKA" =~ shoutedWord)  
    assert matcher.matches()         // TRUE
    def numberMatcher = "1234" =~ /\d+/  
    assert numberMatcher.matches()   // TRUE
  • ==~” - Returns a boolean that specifies if the full String matches the Pattern
  • assert "1234" ==~ /\d+/    // TRUE
    assert "FOO2" ==~ /\d+/    // FALSE!!!

Enhancements to the String Class

In Groovy, the String class has been enhanced with a few “replace*” methods that allow you to leverage regular expressions. These methods originally come from the Matcher class, but attaching them directly to String puts them right at your fingertips.

replaceFirst will replace the first substring matched by a regular expression within the specified String:

assert "Green Eggs and Spam" == "Spam Spam".replaceFirst(/Spam/, "Green Eggs and")

replaceAll will replace all matching substrings within the specified String:

assert "The armor was colored silver" == "The armour was coloured silver".replaceAll(/ou/, "o")

There is an alternate version of replaceAll that takes a closure for the second parameter. This is especially useful in the situations where you want to manipulate the matched value, or groups within the match to dynamically determine the replacement text.

For example, if we wanted to be able to turn a dashed phrase (“foo-bar”) into a camel case word (“fooBar”) we can’t just remove all dash characters, we also need to make the first letter after the dash capitalized (the “B” in “fooBar”).

To do this, we can use a regular expression that captures the first letter after a dash in a group using parenthesis.

def dashedToCamelCase(orig) {
    // regular expression is a dash, followed by parenthesis that form a group where we hold the word's first character
    orig.replaceAll(/-(\w)/) { fullMatch, firstCharacter -> firstCharacter.toUpperCase() }

assert "firstName" == dashedToCamelCase("first-name")

assert "oneTwoThreeFourFiveSixSevenEight" == dashedToCamelCase("one-two-three-four-five-six-seven-eight")

Using the version of replaceAll that takes a closure gives us a chance to manipulate the first character of the word and capitalize it. This closure is always passed the full matched text of the regular expression as the first value, and then any groups as subsequent values.

Here we modify a phone number and keep the area code group, but replace the exchange and station number with hash marks:

assert "612-###-####" == "612-555-1212".replaceAll(/(\d{3})-(\d{3})-(\d{4})/) { fullMatch, areaCode, exchange, stationNumber ->
    assert fullMatch == "612-555-1212" 
    assert areaCode == "612"
    assert exchange == "555"
    assert stationNumber == "1212"
    return "$areaCode-###-####"

Enhancements to Collections

Groovy also makes significant additions to what you can do with Collections. In addition to each, collect, inject, etc, there is a regular expression aware iterator called grep that will pass each item in the Collection through a filter and return a subset of items that match the filter. We can use a regular expression as a filter:

// regular expression says 0 or more characters (".*") followed by the string "bar" that is at the end of the string ("$")
assert ["foobar", "bazbar"] == ["foobar", "bazbar", "barquux"].grep(~/.*bar$/)

You can achieve the same thing with findAll but it takes a little more typing:

assert ["foobar", "bazbar"] == ["foobar", "bazbar", "barquux"].findAll { it ==~ /.*bar$/ } 

Working with Matchers

As we’ve seen, using the =~ operator will return a Matcher object. Many of the existing regular expression examples on the web work by treating the Matcher as a list and getting the first (zero-based) element out of the list:

def matcher = "foobazaarquux" =~ "o(b.*r)q"
assert ["obazaarq", "bazaar"] == matcher[0]
assert "bazaar" == matcher[0][1] // get the first grouping of the first map

This is a little fragile as matcher[0] will throw an error if there was not actually a match. Calling matches() doesn’t help as matches only checks if the regular expression matches the WHOLE string:

("foobazaarquux" =~ "o(b.*r)q").matches()  // returns false!
("foobazaarquux" =~ ".*(b.*r).*").matches()  // returns true, ".*" matches 0 or more chars of any type

You can check getCount() to see how many matches there were for some safety:

def m = "foobar" =~ /quux/
if (m.getCount()) {
    // example won't get here as "quux" doesn't exist in "foobar", the count is 0
        println m[0]

A groovier way to work with Matchers leverages collection iterators and the built in closures that Groovy provides to them. Matcher supports the iterator() method and with that, gets everything else that any groovy List or Collection would have, including collect, inject, findAll, etc.

def paragraph = """
    Lorem ipsum dolor 12:30 AM sit amet, 
    consectetuer adipiscing 1:15 AM elit. 
    Nunc rutrum diam sagittis nisi 9:22 PM.

def HOUR = /10|11|12|[0-9]/
def MINUTE = /[0-5][0-9]/
def AM_PM = /AM|PM/
def time = /($HOUR):($MINUTE) ($AM_PM)/

assert ["12:30 AM", "1:15 AM", "9:22 PM"] == (paragraph =~ time).collect { it }

assert ["12:30 AM", "1:15 AM"] == (paragraph =~ time).grep(~/.*AM$/)

A limitation of the iterator-based methods is that they don’t give you access to the individual groups (hour, minute, am/pm), just the full matched string (“12:30 AM”). The each method is more powerful because as it iterates through, it passes the full match as well as each of the individual groups into the closure.

("foo1 bar30 foo27 baz9 foo600" =~ /foo(\d+)/).each { match, digit -> println "+$digit" }

// result:
// +1
// +27
// +600

Another example (using the paragraph and time Matcher from above) showing how to pretty print all of the timestamps:

(paragraph =~ time).each {match, hour, minute, amPm -> 
    println "$hour:$minute ${amPm == 'AM' ? 'this morning' : 'this evening' }"

// result: 
// 12:30 this morning
// 1:15 this morning
// 9:22 this evening

Regular expressions are a powerful tool that Groovy makes as accessible as any other top-tier scripting language. Using techniques to break more complicated regular expressions into their component pieces can make them much more readable (as in the time example above).

If you’re doing any sort of string processing beyond a simple contains or split, regular expressions in groovy can turn mountains of Java into a couple of lines of code.