perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

[doc] perlpacktut

Open zhijieshi opened this issue 4 years ago • 7 comments

Where

https://perldoc.perl.org/perlpacktut

Description

Issue 1:

The example at the end of "The Basic Principle" packs "byte contents from a string of hexadecimal digits". The code is pack( 'H2' x 10, 30..39 ). It is not really straightforward to see 30 as a "hexadecimal digits". Why making it unnecessarily confusing?

The following would be easier for beginners, avoiding "misunderstanding", which is the purpose of this tutorial.

my $s = pack( 'H2' x 10, '30'..'39');
print "$s\n";

Issue 2:

Since there are unicode strings and byte strings, it is not clear what can be unpacked. It seems unpacking unicode strings may have unexpected result.

#!/usr/bin/perl -w
use v5.34;
use utf8;
use strict;
use warnings;
use Encode qw(encode decode);

my $s = "0123456789😀";
my $b = encode "UTF8", $s;

say "Unpack unicode string 1: ",  unpack( '(H2)*', $s);
say "Unpack unicode string 2: ",  unpack( 'H*', $s);
say "Unpack bytes:            ", unpack( 'H*', $b);

{
use bytes;
say "Unpack unicode string 3: ",  unpack( 'H*', $s);
}

The output is:

Character in 'H' format wrapped in unpack at .\t.pl line 11.
Unpack unicode string 1: 3031323334353637383900
Character in 'H' format wrapped in unpack at .\t.pl line 12.
Unpack unicode string 2: 3031323334353637383900
Unpack bytes:            30313233343536373839f09f9880
Unpack unicode string 3: 30313233343536373839f09f9880

zhijieshi avatar Oct 19 '21 16:10 zhijieshi

Thank you. I agree with your first point, though it may be made even clearer by using strings containing hex digits A-F in the example.

For point 2, the Unicode section probably needs to be rewritten as it's overly abstraction dependent, similar to your "use bytes" example which breaks the Perl string abstraction. I'm not sure exactly what you're suggesting is the problem there otherwise.

Grinnz avatar Oct 19 '21 17:10 Grinnz

For point 2, I would like to see some clarifications in the tutorial. I agree that some sections may "needs to be rewritten". When I read the tutorial, I had these questions.

Q1: Can a unicode string be unpacked? If it is not recommended, then the tutorial can make it clear "do not unpack unicode string".

Q2: The example in the tutorial seems to suggest that it is fine to unpack a unicode string into "strings"? If a unicode string can be unpacked in some cases, when would it work?

while (<>) {
    my ($date, $desc, $income, $expend) =
        unpack("A10xA27xA7xA*", $_);
    $tot_income += $income;
    $tot_expend += $expend;
}

zhijieshi avatar Oct 19 '21 20:10 zhijieshi

It's a bit complex. The Perl string abstraction is simply a sequence of codepoints - not Unicode, nor bytes, until something interprets it as such. The 'a' and 'A' patterns for example will pass through a codepoint whether or not it fits in a byte, but other patterns like 'C' which are defined to operate on bytes have less obvious behavior (and unfortunately don't warn that you're doing something strange).

And your example has an additional complication. Unless you pass -CSD or add a decoding layer to STDIN or the files you are reading from, <> will return encoded bytes, not Unicode strings. So in that example unpack is likely receiving a byte string.

Grinnz avatar Oct 19 '21 21:10 Grinnz

Thanks for the explanation. To summarize, a string may have a codepoint consists of more than one byte. The 'a' or A' pattern works with those codepoints while some other patterns works with bytes only.

zhijieshi avatar Oct 20 '21 00:10 zhijieshi

It's more accurate to say it may have a codepoint which cannot represent a byte because it is higher than 255. What it's represented by internally is immaterial (unless using "use bytes", which is why that is problematic).

Grinnz avatar Oct 20 '21 00:10 Grinnz

I've never fully understood pack and unpack, and I don't think now it's just me.

Looking @zhijieshi 's first example, I would think that if it were changed to

my $s = pack( 'H2' x 26, '41'..'5A' );

things would be clear. But instead this comes out

ABCDEFGHIPQRSTUVWXY`abcdef

And if we make the first value in the range into a number containing a hex-only digit, we get

my $s = pack( 'H2' x 6, '4A'..'4F' );
Argument "4A" isn't numeric in range (or flop)

So, the numbers 30..39 are interpreted as hex, but not all hex numbers can be used here.

And this is near the beginning of a tutorial, talking about beginner level stuff

khwilliamson avatar May 04 '25 17:05 khwilliamson

I've never fully understood pack and unpack, and I don't think now it's just me.

Looking @zhijieshi 's first example, I would think that if it were changed to

my $s = pack( 'H2' x 26, '41'..'5A' );

things would be clear. But instead this comes out

ABCDEFGHIPQRSTUVWXY`abcdef

And if we make the first value in the range into a number containing a hex-only digit, we get

my $s = pack( 'H2' x 6, '4A'..'4F' );
Argument "4A" isn't numeric in range (or flop)

So, the numbers 30..39 are interpreted as hex, but not all hex numbers can be used here.

PP pack/unpack's behavior, regarding bit vectors aka logic that says there is less than 8 bits in a PP TUI or PP wire binary "byte", is very poorly documented, I spent 2 hours figuring out how it works. And the POD examples are often creating mixed-endian base 2 TUI PP strings. The endian-ness inside 1 byte, is the opposite direction of how pack/unpack do intake and output of the bytes of a string, which is almost always left to right, low index to high index.

bulk88 avatar Jun 05 '25 02:06 bulk88