Using Groovy Regular Expressions to Parse Code From a Markdown File

| Comments

The January 2009 issue of GroovyMag was released today. In it, I’ve written an article that shows how easy regular expressions are to use in groovy. It starts with some regular expression basics, shows some common idiomatic groovy usage patterns and wraps up with some of the cool new features that groovy 1.6 is adding to regexp handling.

There are over 30 code samples in then article and I wanted to make sure while writing and editing that all of the code samples ran exactly as they appeared in the article text. Also, when you download an issue of GroovyMag, you get a zip file that has a PDF and a set of code “listing” files for each article. Each listing file contains a snippet of groovy code that appeared in the issue.

Directory screenshot showing listing files

I decided to write a couple of simple groovy scripts to keep things DRY, and to ensure that my edits didn’t break anything. The first script extracts code listings out of my draft article and saved them to individually numbered listing files. The second script executes each of the listing files and reported success or failure for each. Sort of a poor man’s JUnit for writing articles.

Both of these scripts leverage regular expressions and the techniques that are outlined in GroovyMag.

I wrote the article using the markdown syntax. If you haven’t seen markdown before, it’s a wiki-like syntax that is very easy to read and write, and can also be converted into a variety of formats easily, including HTML.

In markdown, code blocks are simply lines that are separated from other lines by at least one blank line (or the start of the file) and are indented at least 4 spaces.

Markdown Screenshot Showing Code Indentation

This is the script that I used to parse through the markdown file. It uses a regular expression to find all of the code blocks in the markdown and then writes each of them to it’s own listing file. Listing files were to be named consecutively from listing_1.txt through listing_n.txt.

#! /usr/bin/env groovy
INPUT_START_OR_BLANK_LINE = /(?:\A|\n\n)/
FOUR_SPACES_OR_TAB = /(?:[ ]{4}|\t)/
CODE = /.*\n+/
CODE_LINES = /(?:$FOUR_SPACES_OR_TAB$CODE)/
LOOKAHEAD_FOR_NON_CODE_LINE = /(?:(?=^[ ]{0,4}\S)|\Z)/

// this regular expression will find all of the consecutive code lines in a markdown file
// in a markdown file, if the line starts with a tab or at least 4 spaces, it's a code line
// slightly modified from one in markdownj
// see: http://github.com/myabc/markdownj/tree/master/src/java/com/petebevin/markdown/MarkdownProcessor.java
MARKDOWN_CODE_BLOCK = "(?m)" + 
                      "$INPUT_START_OR_BLANK_LINE" +
                      "($CODE_LINES+)" +
                      "$LOOKAHEAD_FOR_NON_CODE_LINE"

def removeOldListings(dir) {
    dir.eachFileMatch(~/.*listing_\d+\.txt/) { file ->
        println "Removing $file"
        file.delete()
    }   
}


def createListings(file) {
    listingNumber = 1
    (file.text =~ MARKDOWN_CODE_BLOCK).each { full, codeBlock ->
        def listing = new File("listing_${(listingNumber++).toString().padLeft(3,'0')}.txt") 
        println "Creating $listing"

        // groovy's String.eachLine skips blank lines, but we want these in our source 
        // to make things more readable so we'll make our own eachLine
        (codeBlock =~ /.*/).each { line ->          
            // each markdown code block comes back with a tab or 4 spaces at the beginning, strip those off
            def matcher = (line =~ /$FOUR_SPACES_OR_TAB(.*)/)
            if (matcher.matches()) {
                matcher.find { fullLine, code ->  listing << "$code" }
            } else {
                listing << "$line\n"
            }
        }
    }
}

removeOldListings(new File("."))
createListings(new File("article.markdown"))

Once I had all of those listing file, I used this script to execute each of the listing files and report whether any problems had occurred during the execution. It uses the “eachFileMatch” method groovy adds to the File object, which you can give a regular expression pattern so that you can iterate over a targeted subset of files to process.

#! /usr/bin/env groovy
def executeListings(listingDir) {
    listingDir.eachFileMatch(~/.*listing_\d+\.txt/) { listing ->
        print "Executing $listing..."       
        try {
            new GroovyShell().evaluate(listing)
            println "Success!"
        } catch (java.lang.AssertionError e) {
            println "ERROR!"
            e.printStackTrace()
        }
    }   
}

executeListings(new File("."))

Just like a good set of unit tests, these scripts gave me the courage to make edits to my article, without needing to worry if I was breaking something or forgetting to make a change. It’s a different kind of meta-programming when I can use regular expressions to help me write about using regular expressions :)

Comments