12 April 2019

Spacing Out

I've had some comments about people having trouble with the top-down grammar approach, and I think I may have isolated at least one problem people may have been having. I'm working on a grammar for a language called 'picat' - you can look up a quick explanation at picat.org. It's a constraint-based programming language that maps insanely well onto Perl 6.

This is a fragment of the grammar I'm working on, done in a top-down fashion. The actual grammar rule <comment> isn't the important thing, because this problem can occur with anything. In my case, if you must know, it's a C-style /* .. */ comment. Of course I ran the test to make sure this little block of code properly matched beforehand, so I could go along making one small change at a time, simple because it's fairly late at night and I've got a flight to catch tomorrow..


'go' '=>'

The natural thing to do here is, of course, say to yourself "Hrm, I've got 3 <comment> comment blocks in a row. We all know there are only 3 important numbers in computer science, 0, 1, and Infinity, so 3 is wrong and should be replaced with <comment>+."


'go' '=>'

So I go to rerun the test, as I'm sticking to my nighttime rule of "one change, one retest", and to my horror it breaks. Luckily for me I've only changed one thing, but ... why is it breaking? Surely <A>+ should at least match <A> <A> <A> ... that's how DFA equivalences work in finite automata.

That's also one point where Perl 6 and traditional DFAs (Deterministic Finite Automata) part ways. I like to think of it as Perl 6 being a little too helpful, having spent a lot of time in the mines of lex and yacc, having to be very explicit about what matches where.

Unfortunately modules like Grammar::Debugger, through no fault of their own, can't quite help here, because while it's a great module to tell you what particular rule or token failed, the problem here is between the terms. <A> {whitespace-optional} <A> is subtly different than <A>+ because <A> <A> lets the parser read whitespace between the two terms; <A>+ assumes the terms come one after the other, whitespace be darned.

So, the simplest solution I have to offer is to let the comment eat the whitespace after it as well, so you can insert your <comment> token anywhere you like and it'll still eat the whitespace no matter how you add it. The <comment> token I have, like I said, is for C/C++ style "balanced" comments - really they're not balanced, /* This is a comment */ but this is not */, and /* This is a comment /* so is this */ this looks like it should but really isn't. */

token comment
  '/*' .+? '*/' \s*

And all is well with the grammar. You can put this rule anywhere you like and it'll behave whether you write <comment> <comment> or <comment>+. This little article was inspired by a Twitter user commenting that even after reading my tutorial series, they got into the actual work of creating a grammar and problems started to happen. My original tutorial series was just that, a tutorial, I didn't get into the specifics of what can actually go wrong because I didn't want to interrupt the flow of the tutorials, but now that the series is pretty much done, I think it'll be beneficial to talk about the actual problems of debugging one of these beasts.

And these thing can most definitely be beasts. Using my ANTLR4 to Perl 6 converter you can generate some incredibly huge grammars, now that it's back working again. But just generating them doesn't necessarily mean they'll compile, although a few do right out of the box, which I'm genuinely amazed at. In fact the full compile test actually does ANTLR -> Perl 6 -> compile -> (parse sample code) on a few grammars, and it actually works.

The problem I run into with these is the tools are sort of limited by the compiled nature of the code. Perl 6 does amazing things with precompiling code, and since grammars and regular expressions are one of the hardest-used things in Perl 6, they get compiled and compiled tight, especially judging by the CSV test suites.

But precompilation comes at a cost. Each matching object is a single Perl 6 function that I can't step into, only over. I'm working on ways around that, and I'm pretty confident there is a solution, but getting under the hood of this particular problem hasn't been easy. In the meantime I'm going to keep working on grammar stuff and when I run into problems, well, it's time to write another article. So look forward to a new series, probably with a prosaic name of "Perl 6 Grammars Debun^wDebugged" or something similar.
Thank you again, dear reader. Comments, clarifications and questions are of course welcome.