Unicode and Perl

Over the last couple of days I’ve been involved in a couple of discussions where it is clear that other people don’t understand how Perl deals with Unicode. The documentation is clear and detailed (there’s even a good tutorial) but for some reason people still persist in misunderstanding it.

Here’s a quick quiz. Can you explain (in detail) what is going on with all of these four command-line programs? And for bonus points, which one should we be emulating in our code?

$ perl -E'say "£"'
£
$ perl -Mutf8 -E'say "£"'
�
$ perl -C -E'say "£"'
Â£
$ perl -C -Mutf8 -E'say "£"'
£

$ perl -E'say "£"'

$ perl -Mutf8 -E'say "£"'

�

$ perl -C -E'say "£"'

Â£

$ perl -C -Mutf8 -E'say "£"'

In all cases, assume that my locale is set to en_US.UTF-8.

I’ll post explanations in a few days time.

Update: Coincidentally, Miyagawa posted something very similar on his blog.

2 thoughts on “Unicode and Perl”

In the first case, Perl treats the pound as two bytes C2 and A3. It prints simply outputs them, but they are interpreted as utf-8 pound sign by the terminal.

In the second case, Perl knows it is a pound sign. Its output encoding defaults to latin-1, though, so it outputs the pound in latin-1, i.e. A3. It is invalid in Unicode, so the terminal displays the replacement character.

In the third case, Perl again sees the two bytes as in the case 1. It knows the output is utf-8, so it encodes the two latin-1 characters to utf-8. C2 corresponds to capital the A with circumflex, A3 to the pound sign.

The last case is the “correct” one – Perl sees the pound sign encoded in utf-8, and outputs it again in utf-8.

saved my day, thanks

E. Choroba says:

25 August, 2013 at 08:10

In the first case, Perl treats the pound as two bytes C2 and A3. It prints simply outputs them, but they are interpreted as utf-8 pound sign by the terminal.

In the second case, Perl knows it is a pound sign. Its output encoding defaults to latin-1, though, so it outputs the pound in latin-1, i.e. A3. It is invalid in Unicode, so the terminal displays the replacement character.

In the third case, Perl again sees the two bytes as in the case 1. It knows the output is utf-8, so it encodes the two latin-1 characters to utf-8. C2 corresponds to capital the A with circumflex, A3 to the pound sign.

The last case is the “correct” one – Perl sees the pound sign encoded in utf-8, and outputs it again in utf-8.

perl base64 says:

2 September, 2013 at 15:13

saved my day, thanks

Perl Hacks

Just another Perl Hacker's blog

Related

2 thoughts on “Unicode and Perl”

Leave a ReplyCancel reply

Share this:

Related

2 thoughts on “Unicode and Perl”

Leave a ReplyCancel reply