This article was published yesterday. It shows a way to extract data about a film from IMDB and put it into a local database. Actually, it doesn’t even do that. It produces SQL that you can then run to insert the data.
It’s all rather nasty stuff and indicative of the fact that most people using Perl are still using idioms that would have made us shudder ten years ago.
There were a few things that I didn’t like about the code. The use of curl
to grab data from the web site, the indirect object syntax when creating the XML::Simple object and, in particular, the huge amount of repetitive code used to create the SQL statements.
So I did what anyone would do in my situation. I rewrote it. Here’s my version.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
#!/usr/bin/perl use strict; use warnings; use XML::Simple; use LWP::Simple; @ARGV or die "Please provide movie title in quotes\n"; my $movie = shift; $movie =~ s/\s/+/g; my $movieData = get "http://www.imdbapi.com/?r=XML&t=$movie"; my $data = XMLin( $movieData ); my @fields = qw[released rating director genre writer runtime plot id title votes poster year rated actors]; my %film = %{$data->{movie}}; foreach (@fields) { $film{$_} =~ s/'/\\'/g; } my $tstamp = time(); my $sql = 'INSERT INTO movie_collection ('; $sql .= join ', ', @fields; $sql .= ') VALUES ('; $sql .= join ', ', map { qq['$film{$_}'] } @fields; $sql .= ",'" . time . "');\n"; print $sql; |
I haven’t actually changed that much. I’ve tidied up a bit. Switched to using LWP::Simple, removed some unnecessary escaping, things like that. I have made two pretty big changes. I’ve got rid of all of the nasty variables containing data about a film. A film is a single object and therefore should be stored in a single variable. And, happily enough, the $data you get back from XMLin
contains a hash that does the trick perfectly.
The second change I made was to rejig the way that the SQL is created. By using an array that contains the names of all of the columns in the table, I can generate the SQL programmatically without all of that repetitive code. I’ve even made the SQL a little safer by explicitly listing the columns that we are inserting data into (this has the side effect of no longer needing to insert a NULL into the id column).
Of course, this would just be a first step. The whole idea of generating SQL to run against the database is ugly. You’d really want to use DBIx::Class (or, at the very least, DBI) to insert the data directly into the database. And why mess around with raw XML when you can us something like IMDB::Film to do it?
At that point in my thought process I had an epiphany. You don’t need the database at all. The IMDB data changes all the time. Why take a local copy? Why not just use the web service directly with IMDB::Film (or perhaps WebService::IMDB – I haven’t used either of them so I have no strong opinions on this).
In general, I think that the original code was too complicated. Which made it hard to maintain. My version is better (but I am, of course, biased) but it can be made even better by using more from CPAN.
CPAN is Perl’s killer app. If you’re not using CPAN then you’re not using half the power of Perl.
What do you think? How would you write this program?
Update: A few people have mentioned the fact that I’m directly interpolating random data into my SQL statements – which is generally seen as a bad thing as it opens the door to SQL injection attacks. In my defence, I’d like to make a couple of points.
Firstly, the data I’m using isn’t just any old data. It’s data that is returned from the IMDB API. So it would be hard to use this for a malicious attack on the system (at least until Hollywood makes a film about the life of Bobby Tables).
Secondly, I am cleaning the data before using it. I’m escaping any single quotes in the input data. I think that removes the possibility of attack. I could be wrong though, if that’s the case, please let me know what I’m missing.
But, in general, I agree that this approach is dangerous. This is one of the major advantages of using DBI. By using bound parameters in your SQL statements you can remove possibility of SQL injection attacks.
Update 2: You can, of course, rely on Zefram to point out the issues here. His comment is well worth reading.
Other people (on IRC) raised the potential of other Unicode characters that databases treat as quote characters but that aren’t covered by my substitution.
I really don’t think showing the direct insertion of strings from an outside source in SQL is a good idea. You can do the same thing with placeholders fairly easily, and the code will be much safer.
Oh, you’re absolutely right. And I considered rewriting it to use DBI for exactly that reason. But then I thought that as the data is coming back from the IMDB API, the danger is minimal.
At the very least, I can say that I haven’t added any new insecurities π
http://xkcd.com/327/
no more comments needed.
Bobby Tables! I don’t even have to click the link
Escaping single quotes *might* be enough to avoid SQL injection attacks, or it might not be. Are you comfortable relying on it? I would prefer something more paranoid like
OK that might be a bit extreme, but you can be confident it is safe.
I can be confident that your version will break a lot of the data I get back from the API π See an example. See how much non-alphanumeric data it includes?
I’m becoming more confident in my solution. The argument goes like this. You can only do nasty things to me if you can close my SQL statement and open a new one. All of my interpolated data is within single-quoted strings. Therefore, to start a new statement you need to close my string. But by escaping any single quotes in the data I’m preventing you from doing that.
What am I missing?
Zefram gave an example of a case that your code misses. The point is that by ‘enumerating badness’, you make sure that if you’ve overlooked something it will be a vulnerability. Better to fail safe by rejecting anything not in a list of characters you know to be okay. Both solutions (yours and mine) need extending to handle additional characters: yours does not handle backslash correctly, and mine doesn’t do accented characters. But the consequences of that omission are different. You need to write the test so that even if you have missed something out (which will inevitably happen), it won’t leave you with SQL injection bugs. Losing some accented characters is a much more benign failure mode IMHO.
Your single-quote escaping is *not* sufficient to avoid SQL injection. Most obviously, backslashes also need to be escaped. Specific exploit: I give you the string
q{\');drop database imdb;-- }
. You escape the single quote yieldingq{\\');drop database imdb;-- }
, and then stick it into quotes yieldingq{insert into movie (...) values ('\\');drop database imdb;-- ', ...);}
. Theq{'\\'}
parses as a complete string literal, representing a string containing a single backslash. You lose.The right way to embed data in SQL is to have a function whose input is a data object (in whatever form you’re dealing with in Perl space) and whose output is a string of SQL source constituting an expression that will evaluate to the in-database representation of the same data. If you’re dealing with a string (or anything that you represent as a string in SQL space) then you’d expect that output expression to be in the form of an SQL string literal. The concept generalises to all SQL data types, including ones that don’t have a literal syntax, and to all types of object that you deal with in Perl space.
Here’s the specific logic that I developed at $ork to represent a Perl octet string as an SQL octet string:
This is specifically for MySQL; you need to experiment to write the correct equivalent for your particular database. Also, obviously, if you’re dealing with general Unicode then you’ll need an encoding layer on top of this.
You use it like “
where title=@{[sql_octetstring($title)]}
” or “values (@{[join(q(, ), map { sql_octetstring($film{$_}) } @fields)]})
“. The babycart operator is brilliant for this sort of thing, because it gives you all the right cues about the nesting that is logically occurring.The SQL standard way of escaping single quotes is not to backslash them, but to double them up. That is:
s{‘}{”}g
Of course, many SQL databases support backslash escaping instead of or as well as the SQL standard method.
This is a good example for newbies how things should happen in modern perl. I am going to show it to my students. But can you make a copy of original script, because it can dissappear.
Good idea. See here.
I’d love to see an example of how this looks on a site/page php/html
anyone have a link?