Rewriting Perl code for Raku Part V

Last week we started to talk about the pack() and unpack() builtins for Raku and Perl. These aren’t terribly common built-ins to use, so I thought I’d take some time to go over these in detail and talk about how I use them and debug files that use them.

As a gentle reminder, OLE::Storage_Lite is a Perl module to read and write a subset of the Microsoft OLE storage format. As part of my effort at the start, I’ve got a “translation” of the original Perl code pounded out, without much thought to whether it’ll work, or really even compile. It looks like the Perl version, but with most of the {} changed to <> and -> changed to ..

What to test first… The reading side seems to be the easiest, because I can check object-by-object to see what the data should look like. Replicating that for Raku becomes essentially fixing the bugs I know I’ve introduced on the way.

Testing testing… is this on?

Before we dive into the Raku code, though, let’s just set up a quick test in Perl. There really wasn’t one to begin with, which is a testament to how well-used the module is. I’ve got a ‘test.xls’ file that I’ve already checked in LibreOffice to make sure it works, so I’ll add a test script that reads the file and checks the root object.

use Test::More;
use OLE::Storage_Lite;

my $root = OLE::Storage_Lite->new( 'sample/test.xls' );
use YAML; die Dump($root);
isa_ok $root, 'OLE::Storage_Lite::PPS::Root';
is $root->No, 0;
is $root->PrevPps, 0xfffffffe;

You might be reading the code and wondering what the heck die() is doing in a test suite. It’s not because in my current copy it’s commented out, but it’s a quick and dirty way to get the data for the Raku version of the file, which looks almost the same.

use Test;
use OLE::Storage_Lite;

my $root = OLE::Storage_Lite.new( 'sample/test.xls' );
die $root.perl;
isa_ok $root, 'OLE::Storage_Lite::PPS::Root';
is $root.No, 0;
is $root.PrevPps, 0xfffffffe;

Notice there’s hardly any difference overall, just a few minor syntax tweaks. And I don’t need to use YAML. But I’ve got a Q&D way to run my code, and since my screen looks something like this:

I’ve got most of what I need in my face. This is all a rather plain TMUX setup, running multiple panes so I can see what’s going on. On the left is vim running in split-screen mode with the Perl and Raku test files open. The rest are shells in the Perl and Raku directories, and some commands to get byte dumps of the files.

I’ve also in my shells set up the following aliases:

alias 5 = "perl -Ilib"
alias 5p = "prove -Ilib"
alias 6 = "perl6 -Ilib"
alias 6p = "prove -e'perl6 -Ilib'"

This way I can run both Perl and Raku test suites with just a few keystrokes, and not have to worry about details such as -I paths. You’re of course welcome to do things exactly the same, completely different, or even radically better than I am, in which case please let me know.

You might notice the use of the language’s old name here. I haven’t changed over to the new binaries yet, but the techniques I’ll talk about here won’t change.

Keeping it Clean

We now have two scripts that should produce the same output, but probably won’t, for any number of reasons. I’ve got a whole article’s worth of things that I had to do to make the new module compile, let alone run. But that’s for a later issue.

Let’s start out with this section, which might be familiar to longtime (ha!) readers.

  $rhInfo->{_FILEH_}->seek(0, 0);
  $rhInfo->{_FILEH_}->read($sWk, 8);
  return undef unless($sWk eq "\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1");

This is in Perl, of course. In Raku I’ve chosen to write

  $file.seek( 0, SeekFromBeginning );
  my Str $sWk = $file.read( 8 ).unpack('A8');
  die "Header ID incorrect" if $sWk ne HEADER-ID;

It’s a bit ungraceful to die() inside a module, but this guarantees that execution stops way before it can cause a hard-to-debug problem down the road. The first change is that I’ve refactored $rhInfo->{_FILEH_} out into its own $file variable so I don’t have to repeat references to $rhInfo all over the place, like the original.

Next is using the built-in IO::Handle constant ‘SeekFromBeginning’ instead of the rather anodyne 0 as in Perl. Probably the parent OLE::Storage module looked ahead in the file to determine something before reading in earnest. I’m keeping it here for no good reason other than it might be nice to separate ‘read’ functionality into a different method.

Diving in

The next line will cause some consternation, so I’ll unpack it slowly. The original author used Hungarian notation for their variable names, so the ‘s’ of $sWk means that it’s a string type. I’ve adopted this for the Raku code as well, actually enforcing the variable type without additional code.

File handles have both a fancy lines() method that lets you read files line-by-line, and a raw read() method that lets you read raw bytes. If I stopped right here and just looked at the raw bytes, the code would actually fail, and I’ve talked about why in earlier parts. Suffice to say that read() returns a buffer of uninterpreted bytes, not a string that you have to decode later.

Decoding here is the job of the unpack() statement. It acts just like its Perl counterpart, but is experimental. Lucky for me, it implements enough of the Perl builtin that I can use it to read the entire OLE file.

Now, unlike other builtins (again, keeping in mind it’s experimental,) it’s only available as a method call. There is a version of unpack() that works on multiple arguments, but if you try to call it as a builtin, expect:

===SORRY!=== Error while compiling -e
Undeclared routine:
    unpack used at line 1. Did you mean 'pack'?

This may be fixed in your version, feel free to try it and let me know if I should upgrade 🙂 In any case, the last bit you’re wondering about is the ‘A8’ business as its argument. I think this isn’t explained correctly in the documentation, so I’ll explain in my own way.

read() returns a raw string of bytes, without interpretation. If it sees hex 041, it doesn’t “know” if you meant the ASCII character ‘A’ or the number 41, so it doesn’t interpret the data, it just puts the data into the buffer. It relies on the the Buf(fer)’s pack() and unpack() methods to assign types to the data.

So finally, unpack( "A8" ) pulls out 8 “ASCII” characters and puts them into $sWk. Now I used scare-quotes there because ASCII is a 7-bit encoding, not 8 bits as many people seem to think. It only encodes from 0x00-0x7f, so anything over that isn’t legal ASCII.

Which just means that the “A” of “A8” doesn’t truly correspond to ASCII, but it’s close enough. So, we call unpack( “A8” ) on the buffer that $file.read( 8 ) returns, and get back a string that we can finally check against our header.

Debugging

But what if the header isn’t what we expect? Your first instinct might be to say you must’ve screwed up and sent it the wrong file. Luckily it’s pretty easy to check that, just call $file.slurp.print; That’ll tell you the contents quickly. If it’s text you’ve probably got the wrong file – OLE files do contain text but it’s usually zip’ed or in UCS-2.

Let’s assume though that it’s an actual binary file, and a real spreadsheet that Excel (or LibreOffice in my case) can read. Since the headers don’t match, it must be a different version of OLE that our code isn’t ready to handle.

That means we need to know what the first 8 bytes of the file actually are. We’ve got a bunch of tools at our disposal, but what I want to introduce is hexdump(1) (don’t worry about the (1), force of habit.) Run this command on the file:

hexdump -C sample-file.xls | head -1

This should generate something like this:

00000000  d0 c9 11 a0 af b1 13 d1  00 00 00 00 00 00 00 00  |................|

(original bytes changed to protect the innocent file) The numbers on the left (‘00000000’) tell us how far we are into the file (in hex), the next two groups of 8 are the hex values of the individual bytes of the file, and the dots between ‘|..|’ are where any printable characters would appear, if there were any.

So now we know what the first 8 bytes of this file look like, and we can add (without much muss or fuss) some checks to our original file, and come up with this:

$file.seek( 0, SeekFromBeginning );
my Str $sWk = $file.read( 8 ).unpack('A8');
die "Unknown OLE header!" if $sWk eq "\xd0\xc9\x11\xa0\xaf\xb1\x13\xd1";
die "Header ID incorrect" if $sWk ne HEADER-ID;

This check isn’t in my source, so don’t go looking for it. As far as I know there aren’t any other OLE header strings than what I check for, but then I’m trying to get away without reading the spec. My blood pressure doesn’t need that.

Getting at the details

Of course, binary packed formats contain more stuff than just ASCII strings. OLE was originally written in the days of 16-bit CPUs, so it’s got other ways to pack in data. Let’s look at a fragment of the file format: (not from the spec, this is just my interpretation)

0000: 0xD0 0xCF 0x11 0xE0 0xA1 0xB1 0x1A 0xE1 # header
0008: 0x00 0x09           # size of large block of data (in power-of-2)
000a: 0x00 0x06           # size of small block of data
000c: 0x00 0x00 0x00 0x03 # Number of BDB blocks
0010: 0xff 0xff 0xff 0xfe # Starting block

So, this is the first 20 (0x0010+4) bytes of an OLE header block. You may have already caught on to the fact that there are at least 3 sizes of data here. The first 8 bytes on line 0000 is the header data we talked about ad nauseam.

Next, the header says that a “large” block of data is 2**9 bytes long, and a “small” block of data is 2**6 bytes long, this time in pairs of bytes. Finally we’ve got the number of BDB blocks (whatever those are, probably Berkeley DB) and the starting block’s index number, all in 4-byte chunks.

This means we need to read 2 2-byte chunks and 2 4-byte chunks into memory. This time though, we have to read them as numbers. Once again, unpack() comes to the rescue. Last time we used the ‘A’ character, this time we’ll do something just a little bit different.

Let’s read the documentation for unpack() to see what we can use. Halfway down the page we come to a table which gives us the letter abbreviations for each type of data we can read, and what it is in terms of where it is in memory.

For now, replace the term ‘element’ with ‘byte’ while you’re reading the documentation. We need to read (0x00, 0x09) as a 2-byte integer, so let’s look for “two elements” on the right-hand side. “Extracts two elements and returns them as a single unsigned integer” seems to be what we need.

So it looks like the letter we need to use is “S”, and since we only want to read one at a time, that’s all we need. But the original Perl source uses “v”, so that’s what I’ll use as well.

  $iWk = _getInfoFromFile($rhInfo->{_FILEH_}, 0x1E, 2, "v");
  return undef unless(defined($iWk));
  $rhInfo->{_BIG_BLOCK_SIZE} = 2 ** $iWk;

But as you can see, the Perl source creates a wrapper around the pack() method, much to my annoyance. I’d prefer to simply write this:

$iWk = $file.read( 2 ).unpack( "v" );
%hInfo<_BIG_BLOCK_SIZE> = 2**$iWk;

but to keep things looking as similar to the original Perl code as I can, my code looks like

  my Int $iWk = self._getInfoFromFile( $file, 0x1E, 2, "v" );
  die "Big block size missing" unless defined( $iWk );
  %hInfo<_BIG_BLOCK_SIZE> = 2 ** $iWk;

which is just one line longer, and that’s because of the safety check. Of course, pack() and unpack() can take more than one format character at time. In Perl, there’s yet another mini-language (like regex, and what used to be called the format statement) for these builtins, and that’s not quite done yet.

But you can still take the entire header we’ve collected so far, and write it into a single unpack() statement like so:

my ( $header, $large-size, $small-size, $num-bdbs, $start-block ) =
  $file.read( 20 ).unpack( "A8 vv VV" );

This format is of course much more compact and much easier to read. In all probability once I get done with the main module I’ll convert everything over to this style and the code will become much, much quieter. Binary protocols, especially those for moisture evaporators, tend to have lots of code that looks like:

my $rev = $file.read(2).unpack("v")
if $rev == 0x01 {
  $r2 = $file.read(2).unpack("v");
} else {
  $d2 = $file.read(4).unpack("V");
}

where the next bytes you read depend upon the version of the protocol. Even though I’ve just been rattling off code based on the Perl version, I don’t know what the protocol may do at any given point. So it makes sense to read just one int or long ahead while developing.

I could read a version number as “V” because they started out using “v1”, “v2” and so on up to “v42792643522”. But then 30 lines and 2 revs later they may have changed from “V” to “vcc” because they wanted to support “v2.1.0” style.

And if that header were something like “A8 V CC* V vv” I have to go back and break up the format string and statement at the very least. If I go term-by-term I just have to find the version number and add an if-then statement just below.

Now that you’ve got a fairly good grounding in unpack(), I think it’s time for break. Next time we’ll cover writing our file back out, the most fun part of the operation.


Again, many thanks to those of you that have read this far. As usual, Gentle Reader, please feel free to leave constructive questions, comments, critiques and improvements in the comment section. I do require an email address for validation, but I don’t use it for any other purpose. Thank you again, and I’ll see you in part VI of this series.

Templates II: Electric Boogaloo

Last time on this adventure writing the Template Toolkit language in Raku, we’d just created a small test suite that encompasses some of the problems we’re going to encounter. It’s no use without a grammar and a bunch of other parts, but it does give us an idea of what it’s going to look like.

use Test;
use Template::Toolkit::Grammar;
use Template::Toolkit::Actions;

# ... similar lines above this
is-deeply the-tree( 'xx[% name %]x' ),
    [ 'a', 'a', Directive.new( :content( 'name' ) ), 'a', ];
# ... and similar lines below this.

The list here is what we’re going to return to render(), and I’d love to make that as simple as it can be without being too simple. Let’s focus for the moment just on one bit of the test suite here, the array I’m getting back.

[ 'a', 'a', Directive.new( :content( 'name' ) ), 'a', ];

If these elements were all strings, then all render() would have to do is join the strings together, simples!

method render( Str $text ) returns Str {
  my @terms = # magic to turn text into array of terms
  @terms.join: '';
}

Let’s create the ‘Directive’ class and see what happens, though.

class Directive { has $.content }

my @terms = 'a', 'a', Directive.new( :content( 'name' ) ), 'a';
say @terms.join: '';
# aaDirective<94444485232315>a

Whoops, that’s not what we want. Not bad exactly, but not what we want, either. Well, not to fear. Remember that in Template Toolkit, directives will always return a string. It may be an empty string, but they’ll always return some kind of string.

As a side note, this may not always be true – some directives will even tell the renderer to stop parsing entirely. But it’s a pretty solid starting assumption. For instance, we could say that encountering the STOP directive just makes all future directives return ”.

Of course, I’m harping on the term ‘string’ for a reason. Internally, everything is an object, and every object has a method that returns a readable value. Our Directive class didn’t specify one, so we get the default that returns ‘$name<$address>’.

So, let’s supply our own method.

class Directive { has $.content; method Str { $.content } }

my @terms = 'a', 'a', Directive.new( :content( 'name' ) ), 'a';
say @terms.join: ', ';
# a, a, name, a

There. If we supply a .Str method we can make Directives do what we want. INCLUDE directives would open the file, slurp the contents and return them. Argument directives would take their argument name, look up the value, and return that. Or, more likely, would have a context object passed that does the lookup for them.

Where do we go from here?

Next time we’ll convince Grammars and Actions to work together, making processing a template as simple as:

parse-template( $text ).join( '' );

Next in this series on writing your own template language using Raku, you should be able to define your own Template Toolkit directives and have them return the pre-processed text. We’ll add support for context and the ability to do simple ‘[% name %]’ tags, and maybe explore how to change ‘[%’..’%]’ tags on-the-fly.

Thank you again, dear reader, for your interest, comments and critiques.

A Regex amuse-bouche

Before continuing with the Template series, I thought I’d talk briefly about an interesting (well, at least to me) solution to a little problem. System and user libraries (the kind that end in .so or .a, not Perl libraries) have a section at the top that maps a function name (‘load_user’ or whatever) to an offset into the library, say, 0x193a.

This arrangement worked fine for many years for C, Algol, FORTRAN and most other languages out there. But then along came languages that upset the apple cart, like C++ and Smalltalk, where a programmer could write two ‘load_user’ functions, call ‘load_user(1234)’ or ‘load_user(“Smith, John”)’ and expect the linker to load the right version of ‘load_user.’

The problem here is that the library, the linker and all of the other programs in the tool chain expect there to only be one function called ‘load_user’ in any given library.

Those of us that do Perl 5 and Raku programming don’t have to worry about this, but if you ever want to link to a C++ library, you probably should know at least a bit about “name mangling.”

For a while, utilities like ‘CFront’ for the Macintosh (which the author actually filed bug reports on) were used to “rename” functions like ‘load_user(int)’ and ‘load_user(char*)’ to ‘i_load_user’ and ‘cs_load_user’ before being added to the library, and other tools to do the reverse.

Has Your Mother Sold Her Mangle?

Eventually things settled down, and this process of changing names to fit into the library was “baked in” to the tool chains. Not consistently, of course, couldn’t have that. But conventions arose and even today Wikipedia lists at least 12 different ways to “mangle” ‘void h(void)’ into the existing library formats.

We’ll just look at the first one, ‘_Z1hv’. The ‘_Z’ can be safely ignored, its purpose there is mainly to tell the linker something “special” is going on. ‘1h’ is the function name, and ‘v’ is its first (and only) parameter. Suppose, then, that you were tasked with writing a tool that undid this name mangling.

Your first cut at extracting something useful might look something like

'_Z9load_useri' ~~ m{ ^ '_Z' \d+ (\w+) (.) $ };

Assuming $mangle-me has ‘_Z9load_useri’ in it (The mangled version of ‘void load_user(int)’) the regex engine goes through a bunch of simple steps.

  • Read and ignore ‘_Z’
  • Read and ignore ‘9’
  • Capture ‘load_user’ into $0
  • Capture ‘i’ into $1
  • There is no fifth thing.

But the person that wrote this library is playing silly buggers with someone (obviously us in this case) and there’s also a ‘_Z9load_userss’ which comes out of the other end of the mangle looking like ‘void load_user(char*, char*)’, loading a user with first and last names.

Now we’re in a bit of a quandary. Run the same expression and see what happens:

'_Z9load_userss' ~~ m{ ^ '_Z' \d+ (\w+) (.) $ };

Sure enough, $1 is ‘s’, just as we wanted it, but what about $0? It’s now ‘load_users’, which… y’know, looks too legit to quit. But we must. And now we’re faced with the quandary. Do we make the first parameter an optional capture? ‘m{ … (.)? (.) $ }’ like so?

No, that would capture the ‘r’ of ‘_Z9load_users’. There must be something else in the name that we’re overlooking, some clue… Aha! ‘load_user’ has 9 characters, and look just before it, we’ve got the number 9! Surely that tells us the number of characters in the function name! (and thankfully it actually does.)

Regexes 201

Now, how can we use this to our advantage? First things first, let’s get rid of some dead weight. We don’t care (for the moment) about parameters, so let’s just match the name and number of characters. And because we’re getting all serious up in here, let’s create a quick test.

use Test;
'_Z9load_user' ~~ m{ ^ '_Z' (\d+) (\w+) };
is $0, '9';
is $1, 'load_user';

Run the test script, see if it passes, I’m sure you know the drill. Go ahead and copy that, I’ll wait. Okay, the tests pass, so it’s time to play. I usually am working in a library that’s in git, so I’m usually on the “edit, run tests, git reset, edit…” treadmill by this point.

So… How do we make use of this number? Well, let’s pull up the Regexes page over at docs.raku.org and look around. Back in Perl 5 there used to be this feature ‘m{ a{5} }x’ that matched just 5 copies of whatever it was in front of, that might be a good place to start looking.

That’s now morphed into ‘m{ a ** 5 }’. Great, so let’s replace 5 with $0 and go for it.

'_Z9load_user' ~~ m{ ^ '_Z' (\d+) (\w ** $0) };

“Quantifier quantifies nothing…” That’s weird. $0 is right there, staring me in the face. Maybe I just got the syntax wrong somehow?

'_Z9load_user' ~~ m{ ^ '_Z' (\d+) (\w ** 9) };

Nope, that works. What’s going on here? $0 is defined… Wait, it’s a variable inside a regex, that used to require the ‘e’ modifier, didn’t it? Or something like that… <read the manpage, scratch head… nothing there> Hm. Are we at a dead end?

Kick it up a notch

No, we just need to remember about how string interpolation works. In Raku, “Hello, {$name}!” is a perfectly fine way to interpolate variables into your expression, and it works because no matter where it is, {} signals a code block. Let’s try that, surround $0 with braces.

'_Z9load_user' ~~ m{ ^ '_Z' (\d+) (\w ** {$0}) };

Weird. This time the test failed with ” instead of ‘load_user’. Maybe $0 really isn’t defined? Now that it’s just regular Raku code, let’s check.

'_Z9load_user' ~~ m{ ^ '_Z' (\d+) (\w ** {warn "Got '$0'"; $0}) };

“Use of Nil in string context.” So it’s really empty. Now, we have to really do some reading. Looking at the section on general quantifiers says “only basic literal syntax for the right-hand side of the quantifier [what we want to play with] is supported,” so it looks like we’re at a dead end.

But things like ‘{$0}’ do work, so we can use variables. That means that my problem isn’t that the variable is being ignored, it’s just not being populated when I need it. Let’s look at the section on Capture numbers to see when they get populated.

Aha, you need to “publish” the capture using ‘{}’ right after it. Let’s see if that works…

'_Z9load_user' ~~ m{ ^ '_Z' (\d+) {} (\w ** {warn "Got '$0'"; $0}) };

Nope, something else is going on. And the next block down tells us the final solution – ‘:my’. This lets us create a variable inside the scope of the regular expression and use it as well, so let’s do just that.

'_Z9load_user' ~~ m{ ^ '_Z'
                     :my $length;          # Put $length in the proper scope
                     (\d+) {$length = +$0} # Capture the length
                     (\w ** {$length})     # And extract that many chars.
                   };

And reformat things just a wee bit so we’ve got some room to work with. Now the test actually runs, and reads only as many characters of the function name as needs be.

And just one more thing…

It’s not just function names that follow this pattern, it’s also namespaces, and any special types that the function might use as parameters, so let’s package this up into something more useful.

my regexp pascalish-string {
  :my $length;
  (\d+) {$length = +$0}
  (\w ** {$length})
};
'_Z9load_user' ~~ m{ ^ '_Z' <pascalish-string> };
is $/<pascalish-string>[0], 9;
is $/<pascalish-string>[1], 'load_user';

Pascal implementations were done back when RAM was at more of a premium, and stored a string like ‘load_user’ as ‘\x{09}load_user’ so the compiler knew how many bytes were available immediately rather than having to guess. It was limiting, but this was on computers like the early Macs (we’re talking pre-OS X, for that matter pre-System 7, for those of you that remember that far back.)

So we can use this <pascalish-string> regular expression anywhere we want to match one of our counted terms. Because we’re using ‘my’ inside a regular expression nested inside another regular expression inside a burrito wrapped in an enigma, there are no scoping troubles.

There are probably other ways of doing this, and I would love to see them. If you do come up with a better way to solve this, let me know in the comments and I’ll work your solution into an upcoming article.

As usual, gentle reader, thank you for your time and attention, and if you have any comments, questions, clarifications or criticisms (constructive, please) let me know.

Templates and a Clean Start

Before I get into the meat of the topic, which will eventually lead to a self-modifying grammar (yes, you heard me, self-modifying…) I have a confession to make, in that a series of articles on the old site may have led people astray. I wrote that series thinking to make parsing things where no grammar existed easier.

It may have backfired. So, as a penance, I’m simultaneously pointing theperlfisher.{com,net} to this new site, and starting a new series of articles on Raku programming with a different approach. This time I’ll be incorporating more of my thoughts and what hopefully will be a different approach.

Begin as you mean to go on.

I would love to dump the CMS I’m currently using for something written in Raku. Among the many challenges that presents is displaying HTML, and to paraphrase Clint Eastwood, I do know my limitations. So, I don’t want to write HTML. Ideally, not ever.

So, that means steal borrowing HTML from other sites and making it my own. Since those are usually Perl 5 sites, that means dealing with Template Toolkit. And already I can hear some of you screaming “Raku already handles everything TT used to! Just use interpolated here-docs!”

And, for the most part, you’re absolutely correct. Instead of the clunky ‘[% variable_name %]’ notation you can use clean inline interpolation with ‘{$variable-name}’, and being able to insert blocks of code inline means you don’t have to go through many of the hoops that you’re required to jump through with Template Toolkit.

That’s all absolutely true, and I hope to be able to use all of those features and more in the final CMS, whatever that happens to be. This approach ignores the fact that most HTML out there is written with Template Toolkit, and that rewriting HTML, even if it’s just a few tiny tags, is an investment of time that could be better done elsewhere.

If only there were Template Toolkit for Raku…

Let’s dive in!

If you’re not familiar with Template Toolkit, it’s a fairly lightweight programming language for writing HTML templates, among others. Please don’t confuse it with a markup language, designed to be rendered into HTML. This is a language that lets you combine your own code with a template and generate dynamic displays.

<h1>Hello, [% name %]!</h1>

That is a simple bit of Template Toolkit. Doesn’t look like much, does it? It’s obviously a fragment of a proper HTML document because there’s no ‘<html>’..'</html>’ bracketing it, and obviously whatever’s between ‘[%’ and ‘%]’ is being treated specially. In this case, it’s being rendered by an engine that fills in the name, maybe something like…

$tt.render( 'hello.tt', :name( 'Jeff' ) );

where hello.tt is the name of the template file containing the previous code, and ‘Jeff’ is the name we want to substitute. We’ve got a lot of work to go through before we can get there, though. If you’ve read previous articles of mine on the subject, please try to ignore what I’ve said there.

Off the Deep End

First things first, we need a package to work in. For this, I generally rely on App::Mi6 to do the hard work for me. Start by installing the package with zef, and then we’ll get down to business. (It should be installed by default, if you’re still using rakudobrew please don’t.)

$ zef install App::Mi6
{a bit of noise}
$ mi6 new Template::Toolkit
Successfully created Template-Toolkit
$ cd Template-Toolkit

Ultimately, we want this test (in t/01-basic.t – go ahead and add it) to pass:

use Test;
use Template::Toolkit;
my $tt = Template::Toolkit.new;
is $tt.render( 'hello.tt', :name( 'Jeff' ) ), '<h1>Hello, Jeff!</h1>';

It’ll fail (and miserably, at that) but at least it’ll give us a goal. Also it should give us an idea of how others will use our API. Let’s think about that for a few moments, just to make sure we’re not painting ourselves into any obvious corners.

In order to be useful, our module has to parse Perl 5 Template Toolkit files, and process them in a way that’s useful in Raku. Certain things will go by the wayside, to be sure, but the core will be a module that lets us load, maybe compile, and fill in a template.

Hrm, I just said ‘fill in’ rather than ‘render’, what I said above. Should I change the method name? No, not really, the new module will still do what the Perl 5 code used to, it just won’t do it using Perl 5, so some of the old conventions won’t work. Let’s leave that decision for now, and go on.

Retrograde is all the rage

Let’s apply some basic retrograde logic to what we’ve got here, given what we know of Raku tools. In order to get the string ‘<h1>Hello, Jeff!</h1>’ from ‘<h1>Hello, [% name %]!</h1>’, we need a lot of mechanics at work.

At first glance, it seems pretty obvious that ‘[% name %]’ is a substitution marker, so let’s just do a quick regexp like this:

$text ~~ s:g{ '[%' (\w+) '%]' } = %args{$0};

That should replace every marker in the text with something from an %arguments hash that render() supplies to us. End of column, end of story. But not so fast, if all Template Toolkit supplied to us was the ability to substitute values for keys, then … there’s really no need for the module. And in fact, if you look at the docs, it can do many more things for us.

For example, ‘[% INCLUDE %]’ lets us include other template files in our own, ‘[% IF %]’ .. ‘[% END %]’ lets us do things conditionally, and a whole host of other “directives” are available. But you’ll see here the one thing they have in common is they all start with ‘[%’ and end with ‘%]’.

Hold the phone

That isn’t entirely true, and in fact there’s going to be another article in the series about that. But it’s a good starting point. We may not know much about what the language itself looks like, but I can tell you that tags are balanced, not nested, and every ‘[%’ opening tag has a ‘%]’ tag that closes it.

I’ll also point out that directives ( ‘[% foo %]’ ) can occur one after another without any intervening white space, and may not occur at all. So already some special cases are starting to creep in.

In fact, let’s put this in as a separate test file entirely. So separate that we’re going to put it in a nested directory, in fact. Let’s open t/parser/01-basic.t and add this set of tests:

use Test;
use Template::Toolkit::Parser;

my $p = Template::Toolkit::Parser.new;

0000, AAAA
0001, AAAB
0010, AABA
0011, AABB
0100, ABAA
0101, ABAB
... # and so on up to
1110, BBBA
1111, BBBB

Now just HOLD THE PHONE here… we’re testing directives for Template Toolkit, not binary numbers, and whatever that other column is! Well, that’s true. We want to test text and directives, and make sure that we can get back text when we want it, and directives when we want them.

At first blush you might think it’s just enough to make sure that ‘<h1> Hello,’ is parsed as text, and that ‘[% name %]’ is parsed as a directive, and just leave it at that. But those of you that have worked with regular expressions for a while might wonder how ‘[% name %][% other %]’ gets parsed… does it end at the first ‘%]’, or continue on to the next one?

And what about text mixed with directives? Leading? Trailing text? Wow, a lot of combinations. In fact, if you wanted to be thorough, it wouldn’t hurt to cover all possible combinations of text and directives up to… say, 4 in a row.

Let’s call text ‘T’, and directives ‘D’. I’ve got 4 slots, and only two choices for each. Filling the first slot gives me ‘T_ _ _’ and ‘D_ _ _’, for two choices. I can fill the next slot with ‘T T _ _’, ‘T D _ _’, ‘D T _ _’, and ‘D D _ _’, and I think you can see where we’re going with this.

In fact, replace T with 0 and D with 1, and you’ve got the binary numbers from 0000 to 1111. So, let’s take advantage of this fact, and do some clever editing in our editor of choice:

0010, 0010                            =>
is-deeply the-tree( '0010, AABA       =>
is-deeply the-tree( '0010' ), [ AABA  =>
is-deeply the-tree( '0010' ), [ AABA ];

A few quick search-and-replace commands should get you from the first line to the last line. Now it’s looking more like a Raku test, right? We’re not quite there yet, ‘0010’ still doesn’t look like a string of text and directives, and what’s this AABA thing? One more search-and-replace pass, this time global, should solve that.

is-deeply the-tree( '0010' ), [ AABA ]; =>
is-deeply the-tree( 'xx1x' ), [ AABA ]; =>
is-deeply the-tree( 'xx[% name %]x' ), [ AABA ]; =>
is-deeply the-tree( 'xx[% name %]x' ), [ 'a', 'a', B'a', ]; =>
is-deeply the-tree( 'xx[% name %]x' ),
          [ 'a', 'a', B'a', ]; =>
is-deeply the-tree( 'xx[% name %]x' ),
    [ 'a', 'a', Directive.new( :content( 'name' ) ), 'a', ];

Starting out with the padded binary numbers covers every combination of text and directive possible (at least 4 long). A clever bit of search-and-replace in your favorite editor gives us a working set of test cases that check a set of “real-world” strings, and a file you can almost run. Next time we’ll fill in the details, and get from zero to a minimal (albeit working) Template Toolkit implementation.

As always, dear reader, feel free to post whatever comments, questions, and/or suggestions that you may have, including ideas for future articles. I read and respond to every comment, and thank you for your time.