Rewriting Perl code for Raku Part V

Last week we started to talk about the pack() and unpack() builtins for Raku and Perl. These aren’t terribly common built-ins to use, so I thought I’d take some time to go over these in detail and talk about how I use them and debug files that use them.

As a gentle reminder, OLE::Storage_Lite is a Perl module to read and write a subset of the Microsoft OLE storage format. As part of my effort at the start, I’ve got a “translation” of the original Perl code pounded out, without much thought to whether it’ll work, or really even compile. It looks like the Perl version, but with most of the {} changed to <> and -> changed to ..

What to test first… The reading side seems to be the easiest, because I can check object-by-object to see what the data should look like. Replicating that for Raku becomes essentially fixing the bugs I know I’ve introduced on the way.

Testing testing… is this on?

Before we dive into the Raku code, though, let’s just set up a quick test in Perl. There really wasn’t one to begin with, which is a testament to how well-used the module is. I’ve got a ‘test.xls’ file that I’ve already checked in LibreOffice to make sure it works, so I’ll add a test script that reads the file and checks the root object.

use Test::More;
use OLE::Storage_Lite;

my $root = OLE::Storage_Lite->new( 'sample/test.xls' );
use YAML; die Dump($root);
isa_ok $root, 'OLE::Storage_Lite::PPS::Root';
is $root->No, 0;
is $root->PrevPps, 0xfffffffe;

You might be reading the code and wondering what the heck die() is doing in a test suite. It’s not because in my current copy it’s commented out, but it’s a quick and dirty way to get the data for the Raku version of the file, which looks almost the same.

use Test;
use OLE::Storage_Lite;

my $root = 'sample/test.xls' );
die $root.perl;
isa_ok $root, 'OLE::Storage_Lite::PPS::Root';
is $root.No, 0;
is $root.PrevPps, 0xfffffffe;

Notice there’s hardly any difference overall, just a few minor syntax tweaks. And I don’t need to use YAML. But I’ve got a Q&D way to run my code, and since my screen looks something like this:

I’ve got most of what I need in my face. This is all a rather plain TMUX setup, running multiple panes so I can see what’s going on. On the left is vim running in split-screen mode with the Perl and Raku test files open. The rest are shells in the Perl and Raku directories, and some commands to get byte dumps of the files.

I’ve also in my shells set up the following aliases:

alias 5 = "perl -Ilib"
alias 5p = "prove -Ilib"
alias 6 = "perl6 -Ilib"
alias 6p = "prove -e'perl6 -Ilib'"

This way I can run both Perl and Raku test suites with just a few keystrokes, and not have to worry about details such as -I paths. You’re of course welcome to do things exactly the same, completely different, or even radically better than I am, in which case please let me know.

You might notice the use of the language’s old name here. I haven’t changed over to the new binaries yet, but the techniques I’ll talk about here won’t change.

Keeping it Clean

We now have two scripts that should produce the same output, but probably won’t, for any number of reasons. I’ve got a whole article’s worth of things that I had to do to make the new module compile, let alone run. But that’s for a later issue.

Let’s start out with this section, which might be familiar to longtime (ha!) readers.

  $rhInfo->{_FILEH_}->seek(0, 0);
  $rhInfo->{_FILEH_}->read($sWk, 8);
  return undef unless($sWk eq "\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1");

This is in Perl, of course. In Raku I’ve chosen to write

  $ 0, SeekFromBeginning );
  my Str $sWk = $ 8 ).unpack('A8');
  die "Header ID incorrect" if $sWk ne HEADER-ID;

It’s a bit ungraceful to die() inside a module, but this guarantees that execution stops way before it can cause a hard-to-debug problem down the road. The first change is that I’ve refactored $rhInfo->{_FILEH_} out into its own $file variable so I don’t have to repeat references to $rhInfo all over the place, like the original.

Next is using the built-in IO::Handle constant ‘SeekFromBeginning’ instead of the rather anodyne 0 as in Perl. Probably the parent OLE::Storage module looked ahead in the file to determine something before reading in earnest. I’m keeping it here for no good reason other than it might be nice to separate ‘read’ functionality into a different method.

Diving in

The next line will cause some consternation, so I’ll unpack it slowly. The original author used Hungarian notation for their variable names, so the ‘s’ of $sWk means that it’s a string type. I’ve adopted this for the Raku code as well, actually enforcing the variable type without additional code.

File handles have both a fancy lines() method that lets you read files line-by-line, and a raw read() method that lets you read raw bytes. If I stopped right here and just looked at the raw bytes, the code would actually fail, and I’ve talked about why in earlier parts. Suffice to say that read() returns a buffer of uninterpreted bytes, not a string that you have to decode later.

Decoding here is the job of the unpack() statement. It acts just like its Perl counterpart, but is experimental. Lucky for me, it implements enough of the Perl builtin that I can use it to read the entire OLE file.

Now, unlike other builtins (again, keeping in mind it’s experimental,) it’s only available as a method call. There is a version of unpack() that works on multiple arguments, but if you try to call it as a builtin, expect:

===SORRY!=== Error while compiling -e
Undeclared routine:
    unpack used at line 1. Did you mean 'pack'?

This may be fixed in your version, feel free to try it and let me know if I should upgrade 🙂 In any case, the last bit you’re wondering about is the ‘A8’ business as its argument. I think this isn’t explained correctly in the documentation, so I’ll explain in my own way.

read() returns a raw string of bytes, without interpretation. If it sees hex 041, it doesn’t “know” if you meant the ASCII character ‘A’ or the number 41, so it doesn’t interpret the data, it just puts the data into the buffer. It relies on the the Buf(fer)’s pack() and unpack() methods to assign types to the data.

So finally, unpack( "A8" ) pulls out 8 “ASCII” characters and puts them into $sWk. Now I used scare-quotes there because ASCII is a 7-bit encoding, not 8 bits as many people seem to think. It only encodes from 0x00-0x7f, so anything over that isn’t legal ASCII.

Which just means that the “A” of “A8” doesn’t truly correspond to ASCII, but it’s close enough. So, we call unpack( “A8” ) on the buffer that $ 8 ) returns, and get back a string that we can finally check against our header.


But what if the header isn’t what we expect? Your first instinct might be to say you must’ve screwed up and sent it the wrong file. Luckily it’s pretty easy to check that, just call $file.slurp.print; That’ll tell you the contents quickly. If it’s text you’ve probably got the wrong file – OLE files do contain text but it’s usually zip’ed or in UCS-2.

Let’s assume though that it’s an actual binary file, and a real spreadsheet that Excel (or LibreOffice in my case) can read. Since the headers don’t match, it must be a different version of OLE that our code isn’t ready to handle.

That means we need to know what the first 8 bytes of the file actually are. We’ve got a bunch of tools at our disposal, but what I want to introduce is hexdump(1) (don’t worry about the (1), force of habit.) Run this command on the file:

hexdump -C sample-file.xls | head -1

This should generate something like this:

00000000  d0 c9 11 a0 af b1 13 d1  00 00 00 00 00 00 00 00  |................|

(original bytes changed to protect the innocent file) The numbers on the left (‘00000000’) tell us how far we are into the file (in hex), the next two groups of 8 are the hex values of the individual bytes of the file, and the dots between ‘|..|’ are where any printable characters would appear, if there were any.

So now we know what the first 8 bytes of this file look like, and we can add (without much muss or fuss) some checks to our original file, and come up with this:

$ 0, SeekFromBeginning );
my Str $sWk = $ 8 ).unpack('A8');
die "Unknown OLE header!" if $sWk eq "\xd0\xc9\x11\xa0\xaf\xb1\x13\xd1";
die "Header ID incorrect" if $sWk ne HEADER-ID;

This check isn’t in my source, so don’t go looking for it. As far as I know there aren’t any other OLE header strings than what I check for, but then I’m trying to get away without reading the spec. My blood pressure doesn’t need that.

Getting at the details

Of course, binary packed formats contain more stuff than just ASCII strings. OLE was originally written in the days of 16-bit CPUs, so it’s got other ways to pack in data. Let’s look at a fragment of the file format: (not from the spec, this is just my interpretation)

0000: 0xD0 0xCF 0x11 0xE0 0xA1 0xB1 0x1A 0xE1 # header
0008: 0x00 0x09           # size of large block of data (in power-of-2)
000a: 0x00 0x06           # size of small block of data
000c: 0x00 0x00 0x00 0x03 # Number of BDB blocks
0010: 0xff 0xff 0xff 0xfe # Starting block

So, this is the first 20 (0x0010+4) bytes of an OLE header block. You may have already caught on to the fact that there are at least 3 sizes of data here. The first 8 bytes on line 0000 is the header data we talked about ad nauseam.

Next, the header says that a “large” block of data is 2**9 bytes long, and a “small” block of data is 2**6 bytes long, this time in pairs of bytes. Finally we’ve got the number of BDB blocks (whatever those are, probably Berkeley DB) and the starting block’s index number, all in 4-byte chunks.

This means we need to read 2 2-byte chunks and 2 4-byte chunks into memory. This time though, we have to read them as numbers. Once again, unpack() comes to the rescue. Last time we used the ‘A’ character, this time we’ll do something just a little bit different.

Let’s read the documentation for unpack() to see what we can use. Halfway down the page we come to a table which gives us the letter abbreviations for each type of data we can read, and what it is in terms of where it is in memory.

For now, replace the term ‘element’ with ‘byte’ while you’re reading the documentation. We need to read (0x00, 0x09) as a 2-byte integer, so let’s look for “two elements” on the right-hand side. “Extracts two elements and returns them as a single unsigned integer” seems to be what we need.

So it looks like the letter we need to use is “S”, and since we only want to read one at a time, that’s all we need. But the original Perl source uses “v”, so that’s what I’ll use as well.

  $iWk = _getInfoFromFile($rhInfo->{_FILEH_}, 0x1E, 2, "v");
  return undef unless(defined($iWk));
  $rhInfo->{_BIG_BLOCK_SIZE} = 2 ** $iWk;

But as you can see, the Perl source creates a wrapper around the pack() method, much to my annoyance. I’d prefer to simply write this:

$iWk = $ 2 ).unpack( "v" );
%hInfo<_BIG_BLOCK_SIZE> = 2**$iWk;

but to keep things looking as similar to the original Perl code as I can, my code looks like

  my Int $iWk = self._getInfoFromFile( $file, 0x1E, 2, "v" );
  die "Big block size missing" unless defined( $iWk );
  %hInfo<_BIG_BLOCK_SIZE> = 2 ** $iWk;

which is just one line longer, and that’s because of the safety check. Of course, pack() and unpack() can take more than one format character at time. In Perl, there’s yet another mini-language (like regex, and what used to be called the format statement) for these builtins, and that’s not quite done yet.

But you can still take the entire header we’ve collected so far, and write it into a single unpack() statement like so:

my ( $header, $large-size, $small-size, $num-bdbs, $start-block ) =
  $ 20 ).unpack( "A8 vv VV" );

This format is of course much more compact and much easier to read. In all probability once I get done with the main module I’ll convert everything over to this style and the code will become much, much quieter. Binary protocols, especially those for moisture evaporators, tend to have lots of code that looks like:

my $rev = $"v")
if $rev == 0x01 {
  $r2 = $"v");
} else {
  $d2 = $"V");

where the next bytes you read depend upon the version of the protocol. Even though I’ve just been rattling off code based on the Perl version, I don’t know what the protocol may do at any given point. So it makes sense to read just one int or long ahead while developing.

I could read a version number as “V” because they started out using “v1”, “v2” and so on up to “v42792643522”. But then 30 lines and 2 revs later they may have changed from “V” to “vcc” because they wanted to support “v2.1.0” style.

And if that header were something like “A8 V CC* V vv” I have to go back and break up the format string and statement at the very least. If I go term-by-term I just have to find the version number and add an if-then statement just below.

Now that you’ve got a fairly good grounding in unpack(), I think it’s time for break. Next time we’ll cover writing our file back out, the most fun part of the operation.

Again, many thanks to those of you that have read this far. As usual, Gentle Reader, please feel free to leave constructive questions, comments, critiques and improvements in the comment section. I do require an email address for validation, but I don’t use it for any other purpose. Thank you again, and I’ll see you in part VI of this series.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>