Code Archaeology

Long-time readers will have seen some older posts where I criticised Perl code that I’ve found in various places on the web. I thought it was about time that I admitted to some of the dodgier corners of my programming career.

You may know that one of my hobbies is genealogy. You might also know that there’s a CPAN module for dealing with GEDCOM files and a mailing list for the discussion of the intersection of Perl and genealogy. The list is usually very quiet, but it woke up briefly a few days ago when Ron Savage asked for help reconstructing some old genealogy software of his that had gone missing from his web site. Once he recovered the missing files, I noticed that in the comments he credited a forgotten program of mine for giving him some ideas. This comment included a link to my web site which (embarrassingly) was now a 404. I don’t link to leave broken links on the web, so I swiftly put a holding page in place on my site and went off to find the missing directory.

It turns out that the directory had been used to distribute a number of my early ventures into open source software. The Wayback Machine had many of them but not everything. And then I remembered that I had full back-ups of some earlier versions of my web site squirrelled away somewhere and it only took an hour or so to track them down. So that I don’t mislay them again, I’ve put them all on Github – in an appropriately named repository.

I think that most of this code dates from around 2000-2003. There’s evidence that a lot of it was stored in CVS or Subversion at some time. But the original repositories are long gone.

So, what do we have there? And just how bad is it?

There’s a really old formmail program. And it immediately becomes apparent that when I wrote, not only did I not know as much Perl as I thought, but I was pretty sketchy on the basics of internet security as well. I can’t remember if I ever put it live but I really hope not.

Then there’s the “ms” suite of programs. My freelancing company is called Magnum Solutions and it amused me when I realised that people could potentially assume that this code came from Microsoft. I don’t think anyone ever did. Here, you’ll find the beginnings of what later became the nms project – but the nms versions are far more secure.

There’s the original slavorg bot from the #london.pm IRC channel. The channel still has a similar bot, but the code has (thankfully) been improved a lot since this version.

Then there’s something just called spam. I think I was trying to get some stats on how much spam I was getting.

There are a couple of programs that date from my days wrangling Sybase in the City of London. There’s a replacement for Sybase’s own “isql” command line program. My version is called sqpl. I can’t remember what I didn’t like about isql, or how successful my replacement was. What’s interesting about this program is that there are two versions. One uses DBI to connect to the database, but the other uses Sybase’s own proprietary “CTlib” connection library. Proof, I guess that I was talking to databases back when DBI was too new and shiny to be trusted in production.

The other Sybase-related program is called sybserv. As I recall, Sybase uses a configuration file to define the connection details of the various servers that any given client can connect to. But the format of that file was rather opaque (I seem to remember the IP address being stored as a packed integer in some cases). This program parses this file and presents the data in a far more readable format. I remember using it a lot. I believe it’s the only Perl program I’ve ever written that uses formats.

Then there’s toc. That reads an HTML document, looking for any headers. It then builds a table of contents based on those headers and inserts it into the document. I think it’ll still work.

The final program is webged. This is the one that Ron got inspiration from. It parses a GEDCOM file and turns it into a web site. It works in two modes, you can either pre-generate a whole site (that’s the sane way to use it) or you can use it as a CGI program where it produces each page on the fly as it is requested. I remember that parsing the GEDCOM file was unusably slow, so I implemented an incredibly naive caching mechanism where I stored a Data::Dumper version of the GEDCOM object and just “eval”ed that. I was incredibly proud of myself at the time.

The code in most of these programs is terrible. Or, at least, it’s very much a product of its time. I can forgive the lack of “use warnings” (Perl 5.6 wasn’t widely used back when this code was written) as they all have “-w” instead. But it’s the use of ampersands on most of the subroutine calls that makes me cringe the most.

But please have fun looking at the code and pointing out all of the idiocies. Just don’t put any of the CGI programs on a server that is anywhere near the internet.

And feel free to share any of your early code.

Easy PSGI

When I write replies to questions on StackOverflow and places like that recommending that people abandon CGI programs in favour of something that uses PSGI, I often get some push-back from people claiming that PSGI makes things far too complicated.

I don’t believe that’s true. But I think I know why they say it. I think they say it because most of the time when we say “you should really port that code to PSGI” we follow up with links to Dancer, Catalyst or Mojolicious tutorials.

I know why we do that. I know that a web framework is usually going to make writing a web app far simpler. And, yes, I know that in the Plack::Request documentation, Miyagawa explicitly says:

Note that this module is intended to be used by Plack middleware developers and web application framework developers rather than application developers (end users).

Writing your web application directly using Plack::Request is certainly possible but not recommended: it’s like doing so with mod_perl’s Apache::Request: yet too low level.

If you’re writing a web application, not a framework, then you’re encouraged to use one of the web application frameworks that support PSGI (http://plackperl.org/#frameworks), or see modules like HTTP::Engine to provide higher level Request and Response API on top of PSGI.

And, in general, I agree with him wholeheartedly. But I think that when we’re trying to persuade people to switch to PSGI, these suggestions can get in the way. People see switching their grungy old CGI programs to a web framework as a big job. I don’t think it’s as scary as they might think, but I agree it’s often a non-trivial task.

Even without using a web framework, I think that you can get benefits from moving software to PSGI. When I’m running training courses on PSGI, I emphasise three advantages that PSGI gives you over other Perl web development environments.

  1. PSGI applications are easier to debug and test.
  2. PSGI applications can be deployed in any environment you want without changing a line of code.
  3. Plack Middleware

And I think that you can benefit from all of these features pretty easily, without moving to a framework. I’ve been thinking about the best way to do this and I think I’ve come up with a simple plan:

  • Change your shebang line to /usr/bin/plackup (or equivalent)
  • Put all of your code inside my $app = sub { ... }
  • Switch to using Plack::Request to access all of your input parameters
  • Build up your response output in a variable
  • At the end of the code, create and return the required Plack response (either using Plack::Response or just creating the correct array reference).

That’s all you need. You can drop your new program into your cgi-bin directory and it will just start working. You can immediately benefit from easier testing and later on, you can easily deploy your application in a different environment or start adding in middleware.

As an experiment to find how easy this was, I’ve been porting some old CGI programs. Back in 2000, I wrote three articles introducing CGI programming for Linux Format. I’ve gone back to those articles and converted the CGI programs to PSGI (well, so far I’ve done the programs from the first two articles – I’ll finish the last one in the next day or so, I hope).

It’s not the nicest of code. I was still using the CGI’s HTML generation functions back then. I’ve replaced those calls with HTML::Tiny. And they aren’t very complicated programs at all (they were aimed at complete beginners). But I hope they’ll be a useful guide to how easy it is to start using PSGI.

My programs are on Github. Please let me know what you think.

If you’re interested in modern Perl Web Development Techniques, you might find it useful to attend my upcoming two-day course on the subject.

Update: On Twitter, Miyagawa reminds me that you can use CGI::Emulate::PSGI or CGI::PSGI to run CGI programs under PSGI without changing them at all (or, at least, changing them a lot less than I’m suggesting here). And that’s what I’d probably do if I had a large amount of CGI code that I wanted to to move to PSGI quickly. But I still think it’s worth showing people that simple PSGI programs really aren’t any more complicated than simple CGI programs.

The Joy of Prefetch

If you heard me speak at YAPC or you’ve had any kind of conversation with me over the last few weeks then it’s likely you’ve heard me mention the secret project that I’ve been writing for my wife’s school.

To give you a bit of background, there’s one afternoon a week where the students at the school don’t follow the normal academic timetable. On that afternoon, the teachers all offer classes on wider topics. This year’s topics include Acting, Money Management and Quilt-Making. It’s a wide-ranging selection. Each student chooses one class per term.

This year I offered to write a web app that allowed the students to make their selections. This seemed better than the spreadsheet-based mechanisms that have been used in the past. Each student registers with their school-based email address and then on a given date, they can log in and make their selections.

I wrote the app in Dancer2 (my web framework of choice) and the site started allowing students to make their selections last Thursday morning. In the run-up to the go-live time, Google Analytics showed me that about 180 students were on the site waiting to make their selections. At 7am the selections part of the site went live.

And immediately stopped working. Much to my embarrassment.

It turned out that a disk failed on the server moments after the site went live. It’s the kind of thing that you can’t predict.But it leads to lots of frustrated teenagers and doesn’t give a very good impression.

To give me time to rebuild and stress-test the site we’ve decided to relaunch at 8pm this evening. I’ve spent the weekend rebuilding the app on a new (and more powerful) server.

I’m pretty sure that the timing of the failure was coincidental. I don’t think that my app caused the disk failure. But a failure of this magnitude makes you paranoid, so I spent a lot of yesterday tuning the code.

The area I looked at most closely was the number of database queries that the app was making. There are two main actions that might be slow – the page that builds the list of courses that a student can choose from and the page which saves a student’s selections.

I started with the first of these. I set DBIC_TRACE to 1 and fired up a development copy of the app. I was shocked to see the app run about 120 queries – many of which were identical.

Of course I should have tested this before. And, yes, it’s an idiotic way to build an application. But I’m afraid that using an ORM like DBIx::Class can make it all too easy to write code like this. Fortunately, it makes it easy to fix it too. The secret is “prefetch”.

“Prefetch” is an option you can pass to the the “search” method on a resultset. Here’s an example of the difference that can make.

There are seven year groups in a British secondary school. Most schools call them Year 7 to Year 13 (the earlier years are in primary school). Each year group will have a number of forms. So there’s a one to many relationship between years and forms. In database terms, the form table holds a foreign key to the year table. In DBIC terms, the Year result class has a “has_many” relationship with the Form result class and the Form result class has a “belongs_to” relation with the Year result class.

A naive way to list the years and their associated forms would look like this:

Run code like that with DBIC_TRACE turned on and you’ll see the proliferation of database queries. There’s one query that selects all of the years and then for each year, you get another query to get all of its associated forms.

Of course, if you were writing raw SQL, you wouldn’t do that. You’d write one query that joins the year and form tables and pulls all of the data back at once. And the “prefetch” option gives you a way to do that in DBIC as well.

All we have done here is to interpose a call to “search” which adds the “prefetch” option. If you run this code with DBIC_TRACE turned on, then you’ll see that there’s only one database query and it’ll be very similar to the raw SQL that you would have written – it brings back the data from both of the tables at the same time.

But that’s not all of the cleverness of the “prefetch” option. You might be wondering what the difference is between “prefetch” and the rather similar-sounding “join” option. Well, with “join” the columns from the joined table would be added to your main table’s result set. This would, for example, create some kind of mutant Year resultset object that you could ask for Form data using calls like “get_column(‘forms.name’)”. [Update: I was trying to simplify this explanation and I ended up over-simplifying to the point of complete inaccuracy – joined columns only get added to your result set if you use the “columns” or “select/as” attributes. And the argument to “get_column()” needs to be the column name that you have defined using those options.] And that’s useful sometimes, but often I find it easier to use “prefetch” as that uses the data from the form table to build Form result objects which look exactly as they would if you pulled them directly from the database.

So that’s the kind of change that I made in my code. By prefetching a lot of associated tables I was able to drastically cut down the number of queries made to build that course selection page. Originally, it was about 120 queries. I got it down to three. Of course, each of those queries is a lot larger and is doing far more work. But there’s a lot less time spent compiling SQL and pulling data from the database.

The other page I looked at – the one that saves a student’s selections – wasn’t quite so impressive. Originally it was about twenty queries and I got it down to six.

Reducing the number of database queries is a really useful way to make your applications more efficient and DBIC’s “prefetch” option is a great tool for enabling that. I recommend that you take a close look at it.

After crowing about my success on Twitter I got a reply from a colleague pointing me at Test::DBIC::ExpectedQueries which looks like a great tool for monitoring the number of queries in your app.

Driving a Business with Perl

I’ve been a freelance programmer for over twenty years. One really important part of the job is getting paid for the work I do. Back in 1995 when I started out there wasn’t all of the accounting software available that you get now and (if I recall correctly) the little that was available was all pretty expensive stuff.

At some point I thought to myself “I don’t need to buy one of these expensive systems, I’ll write something myself”. So I sat down and sketched out a database schema and wrote a few Perl programs to insert data about the work I had done and generate invoices from that data.

I don’t remember much about the early versions. I do remember coming to the conclusion that the easiest way to generate PDFs of the invoices was using LaTex and then wasting a lot of time trying to bend LaTeX to my will. I got something that looked vaguely ok eventually, but it was always incredibly painful if I ever needed to edit it in any way. These days, I use wkhtmltopdf and my life is far easier. I understand HTML and CSS in a way that I will never understand LaTeX.

Why am I telling you this, twenty years after I started using this code? Well, during this last week, I finally decided it was time to put the code on Github. There were two reasons for this. Firstly, I thought that it might be useful for other people. And secondly, I’m ashamed to admit that this is the first time that the code has ever been put under any kind of version control (and, yes, this is an embarrassing case of “do as I say, not as I do“). I have no excuses. The software I used to drive my business was in a few files on a single hard drive. Files that I was hacking away at with gay abandon when I thought they needed changing. I am a terrible role model.

Other than all the obvious reasons, I’m sad that it wasn’t in version control as it would have been interesting to trace the evolution of the software over the last twenty years. For example, the database access started as raw DBI, spent a brief time using Class::DBI and at some point all got moved to DBIx::Class. It’s likely that I wasn’t using the Template Toolkit when I started – but I can’t remember what I was using in its place.

Anyway, the code is there now. I don’t give any guarantees for its quality, but it does the job for me. Let me know if you find any of it interesting or useful (or, even, laughable).

p.s. An interesting side effect of putting it under (public) version control – since I uploaded it to Github I have been constantly tweaking it. The potential embarrassment of having my code available for anyone to see means that I’ve made more improvements to it in the last week that I have in the previous five years. I’m even considering replacing all the command line programs with a Dancer app.

p.p.s. I actually use FreeAgent for all my accounting these days. It’s wonderful and I highly recommend it. But I still use my own system to generate invoices.

Subroutines and Ampersands

I’ve had this discussion several times recently, so I thought it was worth writing a blog post so that I have somewhere to point people the next time it comes up.

Using ampersands on subroutine calls (&my_sub or &my_sub(...)) is never necessary and can have potentially surprising side-effects. It should, therefore, never be used and should particularly be avoided in examples aimed at beginners.

Using an ampersand when calling a subroutine has three effects.

  1. It disambiguates the code so the the Perl compiler knows for sure that it has come across a subroutine call.
  2. It turns off prototype checking.
  3. If you use the &my_sub form (i.e. without parentheses) then the current value of @_ is passed on to the called subroutine.

Let’s look at these three effects in a little more detail.

Disambiguating the code is obviously a good idea. But adding the ampersand is not the only way to do it. Adding a pair of parentheses to the end of the call (my_sub()) has exactly the same effect. And, as a bonus, it looks the same as subroutine calls do in pretty much every other programming language ever invented. I can’t think of a single reason why anyone would pick &my_sub over my_sub().

I hope we’re agreed that prototypes are unnecessary in most Perl code (perhaps that needs to be another blog post at some point). Of course there are a few good reasons to use them, but most of us won’t be using them most of the time. If you’re using them, then turning off prototype checking seems to be a bad idea. And if you’re not using them, then it doesn’t matter whether they’re checked or not. There’s no good argument here for  using ampersands.

Then we come to the invisible passing of @_ to the called subroutine. I have no idea why anyone ever thought this was a good idea. The perlsub documentation calls it “an efficiency mechanism” but admits that is it one “that new users may wish to avoid”. If you want @_ to be available to the called subroutine then just pass it in explicitly. Your maintenance programmer (and remember, that could be you in six months time) will be grateful and won’t waste hours trying to work out what is going on.

So, no, there is no good reason to use ampersands when calling subroutines. Please don’t use them.

There is, of course, one case where ampersands are still useful when dealing with subroutines – when you are taking a reference to an existing, named subroutine. But that’s the only case that I can think of.

What do you think? Have I missed something?

It’s unfortunate that a lot of the older documentation on CPAN (and, indeed, some popular beginners’ books) still perpetuate this outdated style. It would be great if we could remove it from all example code.