cpangrep
cpangrep copied to clipboard
Unicode bytes are being doubly encoded
Searching here:
http://grep.cpan.me/?q=-%3Eto_json
I noticed that the Unicode bytes in this file:
https://metacpan.org/source/SJDY/Mojo-Webqq-1.6.0/doc/Webqq.pod#L357
are being doubly encoded.
use Mojo::JSON qw(encode_json);
my $json_hash = $msg->to_json_hash(); #获å–到ç»è¿‡utf8 decodeçš„hash引用
This is due to printing bytes not marked as "utf8" in a filehandle set to :encoding(utf8):
no utf8;
my $string = 'my $json_hash = $msg->to_json_hash(); #获取到经过utf8 decode的hash引用';
binmode STDOUT, ":encoding(UTF-8)";
print $string;
# Gives the same output as above.
Thanks again for the grep.cpan.me service. It's very useful.
Sadly it's not quite that simple.
The approach we currently take is that everything is bytes, while it filters out most binary files as they aren't useful there are cases where people actually want to search over the bytes of the file rather than the resulting Unicode.
Additionally because we don't know anything other than the files are bytes we can't assume they are UTF-8, therefore it's a deliberate decision in this case to show you the bytes. We probably could make this better (use some encoding guessing by default and add an option to show the raw bytes), however that's rather tricky to get right, so currently the approach of only doing bytes is much easier.
Here is download of your file:
wget http://grep.cpan.me/?q=-%3Eto_json
Here is the output of viewing the bytes:
00002fb0 20 24 6a 73 6f 6e 5f 68 61 73 68 20 3d 20 24 6d | $json_hash = $m|
00002fc0 73 67 3c 73 74 72 6f 6e 67 3e 2d 26 67 74 3b 74 |sg<strong>->t|
00002fd0 6f 5f 6a 73 6f 6e 3c 2f 73 74 72 6f 6e 67 3e 5f |o_json</strong>_|
00002fe0 68 61 73 68 28 29 3b 20 20 20 20 23 26 65 67 72 |hash(); #&egr|
00002ff0 61 76 65 3b 26 23 31 34 32 3b 26 6d 69 64 64 6f |ave;Ž&middo|
00003000 74 3b 26 61 72 69 6e 67 3b 26 23 31 34 33 3b 26 |t;å&|
00003010 23 31 35 30 3b 26 61 72 69 6e 67 3b 26 23 31 33 |#150;å
|
00003020 36 3b 26 64 65 67 3b 26 63 63 65 64 69 6c 3b 26 |6;°ç&|
00003030 72 61 71 75 6f 3b 26 23 31 34 33 3b 26 65 67 72 |raquo;&egr|
00003040 61 76 65 3b 26 69 71 75 65 73 74 3b 26 23 31 33 |ave;¿
|
00003050 35 3b 75 74 66 38 20 64 65 63 6f 64 65 26 63 63 |5;utf8 decode&cc|
00003060 65 64 69 6c 3b 26 23 31 35 34 3b 26 23 31 33 32 |edil;š„|
00003070 3b 68 61 73 68 26 61 72 69 6e 67 3b 26 66 72 61 |;hashå&fra|
00003080 63 31 34 3b 26 23 31 34 39 3b 26 63 63 65 64 69 |c14;•&ccedi|
00003090 6c 3b 26 23 31 34 38 3b 26 75 6d 6c 3b 0a 20 20 |l;Ӭ. |
000030a0 20 20 6d 79 20 24 6a 73 6f 6e 5f 74 65 78 74 20 | my $json_text |
You have turned the bytes into HTML entities, which can only represent Unicode values, not raw bytes. Here is the hexdump of the raw bytes from the metacpan page:
wget https://metacpan.org/source/SJDY/Mojo-Webqq-1.6.0/doc/Webqq.pod
00005da0 20 20 20 6d 79 20 24 6a 73 6f 6e 5f 68 61 73 68 | my $json_hash|
00005db0 20 3d 20 24 6d 73 67 2d 26 67 74 3b 74 6f 5f 6a | = $msg->to_j|
00005dc0 73 6f 6e 5f 68 61 73 68 28 29 3b 20 20 20 20 23 |son_hash(); #|
00005dd0 e8 8e b7 e5 8f 96 e5 88 b0 e7 bb 8f e8 bf 87 75 |...............u|
If you look at these bytes you can see "e8 8e b7" at the start of the first line, which is the UTF-8 code for this Chinese character:
获
Please confirm at this website:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E8%8E%B7
What you are sending at the same point in the text is this:
00002fc0 73 67 3c 73 74 72 6f 6e 67 3e 2d 26 67 74 3b 74 |sg<strong>->t|
00002fd0 6f 5f 6a 73 6f 6e 3c 2f 73 74 72 6f 6e 67 3e 5f |o_json</strong>_|
00002fe0 68 61 73 68 28 29 3b 20 20 20 20 23 26 65 67 72 |hash(); #&egr|
00002ff0 61 76 65 3b 26 23 31 34 32 3b 26 6d 69 64 64 6f |ave;Ž&middo|
00003000 74 3b 26 61 72 69 6e 67 3b 26 23 31 34 33 3b 26 |t;å&|
The egrave entity corresponds to Unicode E8, then the Ž entity corresponds to Unicode 8e, then the · entity is Unicode b7 MIDDLE DOT. So you have encoded all the raw bytes into Unicode entities, thus the effect is like printing non-Unicode via :encode(utf8). It was my error to claim you had used :encode(utf8) wrongly, in fact you are promoting non-Unicode bytes into entities.
This sounds like a dupe of https://github.com/dgl/cpangrep/issues/36.
I'm closing this due to a lack of action.
after a mere 3 years?
I'm reopening this one as well and unsubscribing, email [email protected] or [email protected] if you need me to act here again.