Wednesday, September 4, 2013

Verbal Expressions

Recently, thechangelog.com wrote a blog post asking programmers to stop writing regular expressions and begin using "verbal expressions".

Instead of writing this regular expression to parse a URL:

^(?:http)(?:s)?(?:\:\/\/)(?:www\.)?(?:[^\ ]*)$

You could be writing this verbal expression (in Javascript):

var tester = VerEx()
            .startOfLine()
            .then( "http" )
            .maybe( "s" )
            .then( "://" )
            .maybe( "www." )
            .anythingBut( " " )
            .endOfLine();

I'm not sure the second is more readable than the first once you get comfortable with regular expressions, but it could be useful in some circumstances and particularly to programmers that aren't as familiar with the esoteric syntax that is frequently required when matching text.

These "verbal" expressions seem to have become popular, with a GitHub organization listing implementations in 19 languages so far.

With that in mind, let's make a Factor implementation!

Basics

We need to create an object that will keep our state as we build our regular expression, holding a prefix and suffix that surround a source string as well as any modifiers that are requested (like case-insensitivity):

TUPLE: verbexp prefix source suffix modifiers ;

: <verbexp> ( -- verbexp )
    "" "" "" "" verbexp boa ; inline

Making a regular expression is as simple as combining the prefix, source, and suffix, and creating a regular expression with the requested modifiers:

: >regexp ( verbexp -- regexp )
    [ [ prefix>> ] [ source>> ] [ suffix>> ] tri 3append ]
    [ modifiers>> ] bi <optioned-regexp> ; inline

For convenience, we could have a combinator that creates the verbal expression, calls a quotation with it on the stack, then converts it to a regular expression:

: build-regexp ( quot: ( verbexp -- verbexp ) -- regexp )
    '[ <verbexp> @ >regexp ] call ; inline

When we want to add to our expression, we just append it to the source:

: add ( verbexp str -- verbexp )
    '[ _ append ] change-source ;

Anything that is not a letter or a digit can be escaped with a backslash:

: re-escape ( str -- str' )
    [
        [
            dup { [ Letter? ] [ digit? ] } 1||
            [ CHAR: \ , ] unless ,
        ] each
    ] "" make ;

Methods

We can specify "anything" or "anything but":

: anything ( verbexp -- verbexp )
    "(?:.*)" add ;

: anything-but ( verbexp value -- verbexp )
    re-escape "(?:[^" "]*)" surround add ;

We can specify "something" and "something but":

: something ( verbexp -- verbexp )
    "(?:.+)" add ;

: something-but ( verbexp value -- verbexp )
    re-escape "(?:[^" "]+)" surround add ;

We can specify looking for "start of line" or "end of line":

: start-of-line ( verbexp -- verbexp )
    [ "^" append ] change-prefix ;

: end-of-line ( verbexp -- verbexp )
    [ "$" append ] change-suffix ;

We can specify a value ("then"), or an optional value ("maybe"):

: then ( verbexp value -- verbexp )
    re-escape "(?:" ")" surround add ;

: maybe ( verbexp value -- verbexp )
    re-escape "(?:" ")?" surround add ;

We could specify "any of" a set of characters:

: any-of ( verbexp value -- verbexp )
    re-escape "(?:[" "])" surround add ;

Or, maybe simply a line break, tab, word, or space:

: line-break ( verbexp -- verbexp )
    "(?:(?:\\n)|(?:\\r\\n))" add ;

: tab ( verbexp -- verbexp ) "\\t" add ;

: word ( verbexp -- verbexp ) "\\w+" add ;

: space ( verbexp -- verbexp ) "\\s" add ;

Perhaps many of whatever has been specified so far:

: many ( verbexp -- verbexp )
    [
        dup ?last "*+" member? [ "+" append ] unless
    ] change-source ;

Modifiers

Some helper words allow us to easily add and remove modifiers:

: add-modifier ( verbexp ch -- verbexp )
    '[ _ suffix ] change-modifiers ;

: remove-modifier ( verbexp ch -- verbexp )
    '[ _ swap remove ] change-modifiers ;

Should we be case-insensitive or not:

: case-insensitive ( verbexp -- verbexp )
    CHAR: i add-modifier ;

: case-sensitive ( verbexp -- verbexp )
    CHAR: i remove-modifier ;

Should we search across multiple lines or not:

: multiline ( verbexp -- verbexp )
    CHAR: m add-modifier ;

: singleline ( verbexp -- verbexp )
    CHAR: m remove-modifier ;

Testing

We can try out our original example using the unit test framework to show that it works:

{ t } [
    "https://www.google.com" [
        start-of-line
        "http" then
        "s" maybe
        "://" then
        "www." maybe
        " " anything-but
        end-of-line
    ] build-regexp matches?
] unit-test

I'm not convinced this is an improvement. In the current specification for "verbal" expressions, the language for expressing characteristics to match against is relatively limited. Perhaps with some effort, this could evolve into a more capable (but still readable) syntax.

In any event, the code for this is on my GitHub.

No comments: