20 February 2016

From Regular Expressions to Grammars, Pt. 3

If you haven't been following this series, please check out Part 1 of this series before continuing to read.

Movin' on up

After an admittedly whirlwind tour of the basic set of regular expressions, it's time to enter the big league. We'll take it slow, and just for variety, we'll tackle a bit of JavaScript. I've developed this technique over a few prior posts, so if you've been reading past entries this may seem a bit of a refresher.

At the end of the series, we'll have created a small JIT interpreter for a small bit of JavaScript, to wit, this line:
var a = 3; console.log( "Hey, did you know a = " + a + "?" );
We'll take this one step at a time. I generally like to put this into source code control so I can easily checkpoint and revert source when an approach doesn't work out. Like many people these days, I use git, but feel free to use mercurial, darcs, SVN or whatever works best for you.

The best compiler in the world won't help if it can't read your input, so let's start with a simple test.
say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
/'var a = 3; console.log( "Hey, did you know a = " + a + "?" );'/
The 'say' is included so that we can see the output from the match. Right now, we expect that since we're attempting to match a quoted string against an exact copy of itself, the match should trivially succeed. It's roughly the same as
say 'Hello, world!' ~~ /'Hello, world!'/
The entire expression is enclosed within single quotes, so we don't have to concern ourselves with what gets quoted and what doesn't.

Anyway, copy the "say 'a = 3...." statement into a file somewhere and run 'perl6 test.pl6', assuming your file is called 'test.pl6'.  It should print out precisely what it matched, which is the string itself:

jgoff@Demeisen:~$ perl6 test.pl6

「var a = 3; console.log( "Hey, did you know a = " + a + "?" );」
This result lets us know that the match succeeded, and tells us that it matched the entire expression. So, go ahead and checkpoint this in your git (or whatever, I'll assume git from here on out) repository and let's dive in to refactor things just a bit.

Use Cases

If we were at this point to go ahead and write the compiler code, it'd end up being pretty boring. It would be able to compile that exact string to Perl 6 and generate 'my $a = 3; say "Hey, did you know a = " ~ $a ~ "?";', but it couldn't do anything else. See for yourself by changing the "say '...'" part of the expression to test your own variations.

JavaScript commonly gets "minified" before being sent to the browser, with no unnecessary whitespace in the string. So, go ahead and remove the whitespace from the "say '...'" part of your expression, and rerun the test. I'll wait here.

Did it match? I thought not. Do a 'git reset --hard' to get back to the last known-working point, and let's continue. While regular expressions are terribly powerful, they can also be very finicky, as we're about to learn. Let's focus for a few moments on just one of the statements in our compiler, the "a = 3;" statement.

Programmers will come up with lots of creative ways to express this simple statement, from the terse 'a=3;' to tabbing out the '= 3' portion to line up with something else, to the extreme of even putting the semicolon on the next line so that they don't have to add it all the time when copying text around. (Yes, I've seen that style in the wild.)

Remembering the rule that alphanumerics don't need to be escaped, we can combine what we've just learned and match all possible variants like so -
say 'a=3;' ~~ / \s* a \s* '=' \s* 3 \s* ';' \s*/
Which matches any whitespace variant you can dream up, anything from "a=3;" to "\n    a     = 3;" and beyond. There are two lessons to be learned here. The first is that by breaking up our string wherever a programmer might want to insert whitespace, we've taken our first step towards "tokenizing" the expression. We'll go into more depth about that later.

The second is that while it's more flexible, the expression now has these \s* scattered throughout it, making it a little harder to read. Also, inserting "\s*" between every word seems like something a computer should be able to do. And in fact, it can. And there's already a shorthand in Perl 6 for this.
say 'a=3;' ~~ rule { a '=' 3 ';' }
No more \s* interrupting us between every place we might want to put whitespace. Let's get back to our original expression and apply what we've learned here.
say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' '"Hey, did you know a = "' '+' a '+' '"?"' ')' ';' }

Structure and Form

You might notice that "Hey, did you know a = " and "?" didn't receive the same quoting treatment as the other parts of the rule. While there are some deep reasons here, the main one is practicality. Eventually this compiler we're writing should be able to accept "!" instead of "?", "Hello world!" instead of "Hey...", and we want to make sure all of these input strings are acceptable.

So, it's time for a bit of refactoring. Let's take a fairly short expression, "Hello, world!" to work with. You might remember from the last section that the \w shorthand matches alphanumerics, so a first cut at refactoring "Hello, world!" might look like this:
say '"Hello, world!"' ~~ rule { '"' \w+ ',' \w+ '!' '"' }
And indeed, that will match. It'll match "Goodbye, world!", "Frankly, dear!" and a host of other expressions. It'd be nice, though, if there were a way to combine the \w, ',' and '!' into one thing that we could then match on, because that would let us match "hello!", "Whats up, doc" and a lot more. Again, yes, there is a shortcut for that.
say '"Hello, world!"' ~~ rule { '"' <[ \s \w , ! ]>+ '"' }
lets you match a group of characters and/or shortcuts, so <[ a b c ]> on its own would match 'a', 'b' or 'c' but not '3' or '32'. We're still not quite there, because there's a lot more to add in to that expression, you'd have to add '?', '+', '.', ';' for starters, like <[ \s \w , ! ? + . ; ]> and so on. And when you bring Unicode into the mix, you'll have to add another million-plus characters for the CJKV region alone.

By now you won't be surprised at all to learn that yes, there's a shortcut for this too. The shortest of all shortcuts, it's simply '.'. So our expression now looks like
say '"Hello, world!"' ~~ rule { '"' .+ '"' }
Which says "Match a quotation mark, then anything, then another quotation mark." And if you spell it out like that, you can probably spot the problem as fast as I did. Simply put, the '.' shortcut matches any character, and yes, this includes the 「"」 (using Japanese quotation marks to make the problem a little clearer.) So what we really want to do is "Match a quotation mark, then anything but the quotation mark, then another quotation mark."

You say "Anything but" in regex-speak like <-[..]>, where the [..] is whatever you don't want to match. We want "anything but a quotation mark", so in to the [] the quotation mark goes, like so:
say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' '"' <-[ " ]> '"' '+' a '+' '"' <-[ " ]> '"' ')' ';' }
In real-world code people like to put quotation marks inside quoted strings, but we won't worry about those for the moment.

Nest Eggs

Checkpoint here, and go ahead and play with the strings and whitespace. It can now match 'a=3;console.log("Hey, "+a+" is 3!");' and a bunch of other expressions. Those single-quoted double-quotes do stand out, though, especially since there are two of them. Since we've already got a rule, let's create a new rule just to hold those, and give it a name.
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' '"' .+ '"' '+' a '+' '"' .+ '"' ')' ';' }
Of course, we've still got the original text below, so we'll take a cue from the character groups that we touched on briefly, and put the name in '<'...'>' as well.
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' <String> '+' a '+' <String> ')' ';' }
We've just refactored the 'console.log....' expression into something that humans can readily understand, and in fact expand upon as well. Building on the principle of not repeating yourself, we can notice that the variable is repeated in the string, so let's factor that out as well.
my rule Variable { \w+ };
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var <Variable> '=' 3 ';'
          console '.' log '(' <String> '+' <Variable> '+' <String> ')' ';' }
We may as well factor out the number as well...
my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var <Variable> '=' <Number> ';'
          console '.' log '(' <String> '+' <Variable> '+' <String> ')' ';' }

Mere Semantics

You may have noticed that I used '\w+' to match Variables, rather than 'a'. While this allows users of the eventual compiler to write 'var Lennier = 5;' rather than the more prosaic 'var a = 23;', it also allows users to write semantically invalid code. To wit, 'var Prisoner = 6; console.log("I am not " + a + " a number");' where a is not defined.

In principle, we could change the rule so that the second instance of <Variable> must have the same value as the first instance. That way our previous Prisoner example would fail, because 'a' would not be the same as 'Prisoner', and it only allows syntactically and semantically correct code to pass the test.

That's wonderful in principle, and keeps the changes localized to the final rule. Great, and your test suite might even pass, and you release your module out into the real world.

Then someone comes along and writes 'var test = 1; var a = 2; console.log("Hi, test = " + test + ".");'. Which is perfectly valid JavaScript code, and will compile everywhere... except in your compiler. So you say "fine, I'll check for any assignment statements after the function definition. And then someone writes a closure with a 'console.log( .. + variable_outside_the_function + ...)' and breaks your code again.

In summation, let your rules catch syntax errors only, anything else can and should be handled outside. In the fourth and final installment of this series, we'll build a compiler out of this. But let's do one final burst of refactoring in order to clarify the statements.
my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }
You've seen the individual pieces before, but probably not put together quite like this. The <Assignment-Expression> rule matches the entirety of 'var foo = 27', and the <Function-Call> part matches 'console.log("Take " + foo + " down, pass them around")' Our final rule puts the pieces together, matching an assignment statement followed by a function call. I've left the ';' separators outside the <Assignment-Expression> and <Function-Call> because they separate expressions, they're not part of them.

Winding Down

The technique applied here isn't specific to JavaScript, of course. Generally, if you can put whitespace around something without changing its semantics, you've found a token like <Number> or <String>. Figure out where those breaks are, and you're already halfway to decoding the language in question. Figuring out the higher-level rules is much more fun, and along the way you'll find yourself making notes "Okay, lists act like this, wait, can I put a <List> inside a list? Yes? Hey..."

Starting from a string, we've split it into chunks, then factored out some of those chunks into their own rules. Once we could abstract out the 32 into a number, foo_bar into a variable and figure out the String type, it's a question of putting them together in combinations, like '<Variable> = <Number>'. If you want, you can even take what we've done above and refactor it down more.

my rule Variable-Declaration { var <Assignment-Expression> }
my rule Assignment-Expression { <Variable> '=' ( <Number> | <String> | <Variable> ) }
Hopefully you can see where this is going, because this handles 'var a = 32; var b = c; var d = "hi"' along with so much more. But this will have to wait for Part 4.


  1. In the last code block there is a closing paren missing in line 2, I think. The opening paren is right before the Number token.

    1. Thanks for the attention to detail. I try to make sure the samples run, but in between copy/pasting I occasionally lose some things.