This time around we’re going to talk about how to rewrite Perl code in Raku. Even in 2019, a lot of the office world revolves around spreadsheets, whether they be Excel, LibreOffice or simple .csv files. Perl 5 has a plethora of modules to do this, a quick search for ‘Spreadsheet’ on MetaCPAN should convince you of that.
The Raku world doesn’t have quite as many modules as you’d expect, though. While it’s been around for a few years, “heavy lifting” modules like Spreadsheet stuff really haven’t come around yet. This involves packing and unpacking binary formats, and in Perl 5 this centered around the pack and unpack builtins, which are relative newcomers to Raku.
But Raku has built-in binary buffers, which take care of most of the need for pack/unpack. The main reason I can see is the OLE storage format. Basically it’s Microsoft’s way of packing a file system into a single data file. And at this point the proverbial yaks start to pile up, and reasonable people say “You know, Excel still accepts .csv files, I know how to build those.”
Enter raku-OLE-Storage_Lite – this is my translation-in-progress from Perl 5 to Raku. As of this writing it can read an entire OLE file (without data) and write a good portion of the sample file – I believe I’ve got maybe two methods left to debug.
Knee deep in yaks
CSV files are easy to write, but they come with their own set of troubles. When you import a .csv file into your Excel app (or LibreOffice, or whatever) you’re faced with a complex dialog asking you how to import your data, and the average user doesn’t want that every time, they just want to open their spreadsheet.
So, it’s time to follow Liz’s lead and rewrite in Raku an existing module. First thing I did was go to Spreadsheet::ParseExcel and see how they did things. Within a few minutes I’d already encountered the first yak. After opening the file, it delegates it to OLE::Storage_Lite, which is much like James Brown, the “hardest-working man in show business”.
It’s still on version 0.19 at the time of writing, but I assure you that’s only because the current maintainer hasn’t updated the version to reflect reality. It may be legacy Perl rough-and-tumble code, but it’s been around for a long time. It wears its battle scars proudly.
It relies heavily on pack and unpack, which are still technically experimental in Raku. The OO and coding style betrays its pre-5.00 origins, and the tests are, well, very pragmatic. “Does it load? Great! Can it convert timestamps internally? Great! Ship it!”
To its credit, there’s a sample directory where you can use smpview.pl to view the contents of the internal filesystem of any OLE file, and a sample writer to create a known-working OLE file. That’ll do as a starting point.
So, reading an Excel spreadsheet means reading an OLE file system. And when I say file system, I’m not kidding. Inside your typical .xlsx file, there’s a small header and a root object. The root object contains “pointers” (really file offsets) to a document object, and inside that are file objects, each with pointers to the different blocks.
This is all intended to reflect the original disk layout, so it looks very much like an NTFS superblock and block layout. The documentation seems to have moved to this page detailing OLE 1.0 and 2.0 formats, I’m not at all certain what the current version has.
How are Excel spreadsheets arranged in here? Worksheets are OLE directories, and inside each worksheet, tabs are individual files. How’s that for a bit of inspiration? Luckily the Root directory, Files and nested Directories are all separate objects, with at least a few common methods aggregated into a superclass.
This is a long-winded way of saying the module in question is very much legacy code. And, as I want to bring it into the proverbial light, I’ve got to give some issues some thought.
- No useful tests, so I’ll have to write those.
- How much code do I want to sacrifice?
- How much can I save?
Well, I can put off #2 and #3 while writing some tests. Whoa, wait a minute. I don’t have a test file to work with, just some scripts over in sample/. Mumble, mumble, more yaks. Read README, find that smpsv.pl will create one, run that.
Great, I’ve got a sample test.xsl file. But given the amount of potential bit-rot it seems prudent to actually make sure that I’ve got a working Excel file before committing a few days (ha!) to getting a module working. Double-click it, launch into Excel’s cloud-serviced app, find that it’s one of those Win10 panes I’ve never figured out how to close, open task-killer, kill that.
Launch LibreOffice which I happen to have lying around – my current project at work is parsing a spreadsheet in Perl 5, which is what inspired this whole workload.
Yep, that parses; looks a bit odd because it’s coming up with a Japanese font, and some arbitrary English text, but it works. Also, looking at the code it generates all three object types – Root, File and Dir, so it’ll exercise the major code paths. Bonus.
Now I’ve got the makings of a simple test file. The script builds objects individually, so I can run the individual calls, and check that the object’s internals look the way I want.
my $oDt = OLE::Storage_Lite::PPS::Root->new( [ ], [ 0, 0, 16, 4, 10, 100 ], # 2000/11/4 16:00:00:0000 [ $oWk, $oDir ] );
In Raku, this converts to:
my $oDt = OLE::Storage_Lite::PPS::Root.new( (), ( 0, 0, 16, 4, 10, 100 ), # 2000/11/4 16:00:00:0000 ( $oWk, $oDir ) );
I’ve made one change already, to make things simpler for Raku users. In Perl, you have to pass lists as references unless you want to use the new function signatures. In Raku, you can just pass lists as you would ordinarily to your method call.
Using native data types rather than passing references around may seem a bit odd at first to new Raku programmers, but the new variable classes are easier to enforce strong typing on later, when you get used to the language.
Going with the flow
Now we’ve got something we can test, namely making sure that we’ve got a valid OLE Root document. So, before we go ahead with the code, I’ll share a few little things. I know very little about this code, so I want to make sure that I intimately copy each detail of the object at this stage. Later on I might get fancy and replace things with their own object types, but for now, my goal is going to be 1:1 replication.
I tend to like tmux as a shell environment, haven’t really gotten along with UIs. So, keeping in mind that I wanted an absolute 1:1 copy of the original object, I ended up doing this:
- Switch to new window, open my copy of ‘samples/smpsv.pl’ in vim
- Add ‘use YAML; die Dump( $oDt ) just below the line where it gets created
- Switch to new window, run the sample script, copy the YAML output
- Close the two new windows I created to keep clutter down
- Paste the YAML code into the new Raku test.
my $oDt = OLE::Storage_Lite::PPS::Root.new( (), ( 0, 0, 16, 4, 10, 100 ), # 2000/11/4 16:00:00:0000 ( $oWk, $oDir ) ); -- Name: "R\0o\0o\0t\0 \0E\0n\0t\0r\0y\0" No: ~ Time2nd: - 0 - 0 - 16 - 4 - 10 - 100 # and so on...
This should contain all I need to create an OLE file from this set of objects. I’m using this as a sneaky way of not reading the spec, at least not yet. As the old title goes: Algorithm + Data Structure = Program. Using YAML (or Data::Dumper) gives me the data structure, copying the Perl 5 code into Raku gives me the algorithm.
I should almost be able to keep line-for-line fidelity, so when a patch is posted to the Perl 5 source I can import it into Raku without too much trouble. But once I’ve got a better test base and a few users in Raku I’ll probably rewrite this whole module in a more Raku-ready fashion. I can keep the old module around for reference.
But we’ve also got a surprise lurking here. “R\0o\0ot\0 \0E\0n\0t\0r\0y\0” looks like binary garbage, but is actually UCS-2, I think. If it is, then the OLE file is limited to a subset of Unicode. I can put restrictions on it later if I have to, but ATM I actually don’t care.
I’ve done enough time in the i18n salt mines that I know how to deal with this. Store the string in the best format possible (UTF-8 here) internally. When the time comes to write it to the network or disk, translate it to the final encoding.
This way I can see what all the attributes are at a glance without changing encoding. I can also manipulate everything using regular Raku code until the last moment. If I have to, I can use Raku’s gradual typing to constrain the string. More importantly, I don’t have to do any of this now.
Got any change?
This means I’m going to change things just a little bit more. When data gets added to ‘Name’ I’m going to assume it’s UTF-8. Since I’m not doing any I/O yet, I can make whatever assumptions I want. Keeping the internals simple keeps my life simple, at least.
So I’ll write out a quick is-deeply test and get on with things:
is-deeply $oDt, ( Name => 'Root Entry', Time2nd => ( 0, 0, 16, 4, 10, 100 ), # ... Child => ( $oWk, $oDir ) );
This looks pretty straightforward, and almost how you’d write the original test in Perl 5. It won’t run yet, but that’s something we’ll tackle in the next part in the series.
I’m not done quite yet, because I’ve got a lot of these things to write, and not all of them may have the ‘Child’ attribute. I could write a tiny method that skipped over the ‘Child’ attribute along with anything else I wanted, but that felt clumsy. It looked like:
ok sorta-deeply $oDt, ( Name => 'Root Window', Time2nd => ( 0, 0, 16, 4, 10, 100 ), # ... ), ( 'Child' );
And notice that sorta-deeply is a function that does all the work, then passes a simple Bool back to the test. I’d end up writing all of the code that is-deeply does (except for the recursion), and get something back that’s less useful.
Next time we’ll get into making these tests pass. I’m writing the next section right after this, but you won’t get to see it for another week or so, I’m afraid. If you have questions or comments about the first part of this series, please feel free to comment below.