YAML-PP-p5 icon indicating copy to clipboard operation
YAML-PP-p5 copied to clipboard

Description and implementation of YAML::PP::Schema::Binary does not make sense

Open pali opened this issue 5 years ago • 4 comments

First description:

https://github.com/perlpunk/YAML-PP-p5/blob/78a05060ad2865ef84595c60d186f71dec85b254/lib/YAML/PP/Schema/Binary.pm#L77-L81

encode_base64() is a function: {0x00..0xFF}ᴺ →{0x2B, 0x2F, 0x30..0x39, 0x41..0x5A, 0x61..7A}ᴹ So YAML::PP::Schema::Binary obviously cannot take string which contains {0x000100..0x10FFFF}.

Binary data are defined as stream of octets, therefore from set {0x00..0xFF} like encode_base64() function takes it.

And next implementation:

https://github.com/perlpunk/YAML-PP-p5/blob/78a05060ad2865ef84595c60d186f71dec85b254/lib/YAML/PP/Schema/Binary.pm#L29-L39

unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/)is equivalent to if ($binary =~ m/[\x{00}-\x{7E}]/) checks for all 7-bit ASCII characters except 7-bit ASCII \x{7F}. Comment says that this code is for ASCII which is not truth as it is ASCII ∖ {0x7F}.

Next there is check if (utf8::is_utf8($binary)) which in our case is basically: if ($binary !~ m/[\x00-\xFF]/ and (int(rand(2))%2 == 1 or $binary =~ m/[\x80-\xFF]/)) So this code always skips strings with values from set {0x000100..0x10FFFF} and then (pseudo)-randomly (depending on internal representation of scalar) also skips strings from set {0x80..0xFF}. Comment says that this code is for utf8, but it is not truth, see documentation and my above simplified implementation.

And finally it calls encode_base64 function. When this function is called? Strings with only {0x00..0x7E} are ignored by first ASCII check. Then by second checks are always ignored strings which have at least one value from {0x000100..0x10FFFF}. So encode_base64 is called when string contains at least one value 0x7F or when there is at least one value from set {0x80..0xFF} and is_utf8 (pseudo)-randomly returned false.

Suggested fix

  1. Change YAML::PP::Schema::Binary module to work only with binary data as name suggests. Therefore with strings which contains only values from set {0x00..0xFF}.

  2. Use encode_base64() always when input string is non-7-bit ASCII, therefore contains at least one value from set {0x80..0xFF}.

  3. Decide if Base64 encoding is really needed for strings with character 0x7F. It is 7-bit ASCII therefore in 7-bit applications it is not needed special treatment for it. But I'm not sure if YAML needs special treatment of 0x7F or not.

CC @2shortplanks Please look at it and ideally do some correction if I wrote some mistake. Some parts I simplified to make it more easier to understand.

pali avatar Jan 23 '20 16:01 pali