cpangrep Unicode bytes are being doubly encoded

Searching here:

http://grep.cpan.me/?q=-%3Eto_json

I noticed that the Unicode bytes in this file:

https://metacpan.org/source/SJDY/Mojo-Webqq-1.6.0/doc/Webqq.pod#L357

are being doubly encoded.

use Mojo::JSON qw(encode_json);
my $json_hash = $msg->to_json_hash();    #èŽ·å–åˆ°ç»è¿‡utf8 decodeçš„hashå¼•ç”¨

This is due to printing bytes not marked as "utf8" in a filehandle set to :encoding(utf8):

no utf8;
my $string = 'my $json_hash = $msg->to_json_hash();    #获取到经过utf8 decode的hash引用';
binmode STDOUT, ":encoding(UTF-8)";
print $string;
# Gives the same output as above.

Thanks again for the grep.cpan.me service. It's very useful.

Nov 07 '15 10:11 benkasminbullock

Sadly it's not quite that simple.

The approach we currently take is that everything is bytes, while it filters out most binary files as they aren't useful there are cases where people actually want to search over the bytes of the file rather than the resulting Unicode.

Additionally because we don't know anything other than the files are bytes we can't assume they are UTF-8, therefore it's a deliberate decision in this case to show you the bytes. We probably could make this better (use some encoding guessing by default and add an option to show the raw bytes), however that's rather tricky to get right, so currently the approach of only doing bytes is much easier.

Nov 07 '15 10:11 dgl

Here is download of your file:

wget http://grep.cpan.me/?q=-%3Eto_json

Here is the output of viewing the bytes:

00002fb0  20 24 6a 73 6f 6e 5f 68  61 73 68 20 3d 20 24 6d  | $json_hash = $m|
00002fc0  73 67 3c 73 74 72 6f 6e  67 3e 2d 26 67 74 3b 74  |sg<strong>-&gt;t|
00002fd0  6f 5f 6a 73 6f 6e 3c 2f  73 74 72 6f 6e 67 3e 5f  |o_json</strong>_|
00002fe0  68 61 73 68 28 29 3b 20  20 20 20 23 26 65 67 72  |hash();    #&egr|
00002ff0  61 76 65 3b 26 23 31 34  32 3b 26 6d 69 64 64 6f  |ave;&#142;&middo|
00003000  74 3b 26 61 72 69 6e 67  3b 26 23 31 34 33 3b 26  |t;&aring;&#143;&|
00003010  23 31 35 30 3b 26 61 72  69 6e 67 3b 26 23 31 33  |#150;&aring;&#13|
00003020  36 3b 26 64 65 67 3b 26  63 63 65 64 69 6c 3b 26  |6;&deg;&ccedil;&|
00003030  72 61 71 75 6f 3b 26 23  31 34 33 3b 26 65 67 72  |raquo;&#143;&egr|
00003040  61 76 65 3b 26 69 71 75  65 73 74 3b 26 23 31 33  |ave;&iquest;&#13|
00003050  35 3b 75 74 66 38 20 64  65 63 6f 64 65 26 63 63  |5;utf8 decode&cc|
00003060  65 64 69 6c 3b 26 23 31  35 34 3b 26 23 31 33 32  |edil;&#154;&#132|
00003070  3b 68 61 73 68 26 61 72  69 6e 67 3b 26 66 72 61  |;hash&aring;&fra|
00003080  63 31 34 3b 26 23 31 34  39 3b 26 63 63 65 64 69  |c14;&#149;&ccedi|
00003090  6c 3b 26 23 31 34 38 3b  26 75 6d 6c 3b 0a 20 20  |l;&#148;&uml;.  |
000030a0  20 20 6d 79 20 24 6a 73  6f 6e 5f 74 65 78 74 20  |  my $json_text |

You have turned the bytes into HTML entities, which can only represent Unicode values, not raw bytes. Here is the hexdump of the raw bytes from the metacpan page:

wget https://metacpan.org/source/SJDY/Mojo-Webqq-1.6.0/doc/Webqq.pod

00005da0  20 20 20 6d 79 20 24 6a  73 6f 6e 5f 68 61 73 68  |   my $json_hash|
00005db0  20 3d 20 24 6d 73 67 2d  26 67 74 3b 74 6f 5f 6a  | = $msg-&gt;to_j|
00005dc0  73 6f 6e 5f 68 61 73 68  28 29 3b 20 20 20 20 23  |son_hash();    #|
00005dd0  e8 8e b7 e5 8f 96 e5 88  b0 e7 bb 8f e8 bf 87 75  |...............u|

If you look at these bytes you can see "e8 8e b7" at the start of the first line, which is the UTF-8 code for this Chinese character:

获

Please confirm at this website:

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E8%8E%B7

What you are sending at the same point in the text is this:

00002fc0  73 67 3c 73 74 72 6f 6e  67 3e 2d 26 67 74 3b 74  |sg<strong>-&gt;t|
00002fd0  6f 5f 6a 73 6f 6e 3c 2f  73 74 72 6f 6e 67 3e 5f  |o_json</strong>_|
00002fe0  68 61 73 68 28 29 3b 20  20 20 20 23 26 65 67 72  |hash();    #&egr|
00002ff0  61 76 65 3b 26 23 31 34  32 3b 26 6d 69 64 64 6f  |ave;&#142;&middo|
00003000  74 3b 26 61 72 69 6e 67  3b 26 23 31 34 33 3b 26  |t;&aring;&#143;&|

The egrave entity corresponds to Unicode E8, then the &#142 entity corresponds to Unicode 8e, then the · entity is Unicode b7 MIDDLE DOT. So you have encoded all the raw bytes into Unicode entities, thus the effect is like printing non-Unicode via :encode(utf8). It was my error to claim you had used :encode(utf8) wrongly, in fact you are promoting non-Unicode bytes into entities.

Nov 07 '15 12:11 benkasminbullock

This sounds like a dupe of https://github.com/dgl/cpangrep/issues/36.

Nov 07 '15 19:11 karenetheridge

I'm closing this due to a lack of action.

Oct 24 '18 00:10 benkasminbullock

after a mere 3 years?

Oct 24 '18 00:10 karenetheridge

I'm reopening this one as well and unsubscribing, email [email protected] or [email protected] if you need me to act here again.

Oct 24 '18 00:10 benkasminbullock

cpangrep cpangrep copied to clipboard

Unicode bytes are being doubly encoded

cpangrep
cpangrep copied to clipboard