Over the last couple of days I’ve been involved in a couple of discussions where it is clear that other people don’t understand how Perl deals with Unicode. The documentation is clear and detailed (there’s even a good tutorial) but for some reason people still persist in misunderstanding it.
Here’s a quick quiz. Can you explain (in detail) what is going on with all of these four command-line programs? And for bonus points, which one should we be emulating in our code?
1 2 3 4 5 6 7 8 |
$ perl -E'say "£"' £ $ perl -Mutf8 -E'say "£"' � $ perl -C -E'say "£"' £ $ perl -C -Mutf8 -E'say "£"' £ |
In all cases, assume that my locale is set to en_US.UTF-8.
I’ll post explanations in a few days time.
Update: Coincidentally, Miyagawa posted something very similar on his blog.
In the first case, Perl treats the pound as two bytes C2 and A3. It prints simply outputs them, but they are interpreted as utf-8 pound sign by the terminal.
In the second case, Perl knows it is a pound sign. Its output encoding defaults to latin-1, though, so it outputs the pound in latin-1, i.e. A3. It is invalid in Unicode, so the terminal displays the replacement character.
In the third case, Perl again sees the two bytes as in the case 1. It knows the output is utf-8, so it encodes the two latin-1 characters to utf-8. C2 corresponds to capital the A with circumflex, A3 to the pound sign.
The last case is the “correct” one – Perl sees the pound sign encoded in utf-8, and outputs it again in utf-8.
saved my day, thanks