YAML-PP-p5
YAML-PP-p5 copied to clipboard
Description and implementation of YAML::PP::Schema::Binary does not make sense
First description:
https://github.com/perlpunk/YAML-PP-p5/blob/78a05060ad2865ef84595c60d186f71dec85b254/lib/YAML/PP/Schema/Binary.pm#L77-L81
encode_base64()
is a function: {0x00..0xFF}ᴺ →{0x2B, 0x2F, 0x30..0x39, 0x41..0x5A, 0x61..7A}ᴹ
So YAML::PP::Schema::Binary obviously cannot take string which contains {0x000100..0x10FFFF}.
Binary data are defined as stream of octets, therefore from set {0x00..0xFF} like encode_base64()
function takes it.
And next implementation:
https://github.com/perlpunk/YAML-PP-p5/blob/78a05060ad2865ef84595c60d186f71dec85b254/lib/YAML/PP/Schema/Binary.pm#L29-L39
unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/)
is equivalent to if ($binary =~ m/[\x{00}-\x{7E}]/)
checks for all 7-bit ASCII characters except 7-bit ASCII \x{7F}
. Comment says that this code is for ASCII
which is not truth as it is ASCII
∖ {0x7F}.
Next there is check if (utf8::is_utf8($binary))
which in our case is basically:
if ($binary !~ m/[\x00-\xFF]/ and (int(rand(2))%2 == 1 or $binary =~ m/[\x80-\xFF]/))
So this code always skips strings with values from set {0x000100..0x10FFFF} and then (pseudo)-randomly (depending on internal representation of scalar) also skips strings from set {0x80..0xFF}. Comment says that this code is for utf8
, but it is not truth, see documentation and my above simplified implementation.
And finally it calls encode_base64
function. When this function is called? Strings with only {0x00..0x7E} are ignored by first ASCII
check. Then by second checks are always ignored strings which have at least one value from {0x000100..0x10FFFF}. So encode_base64
is called when string contains at least one value 0x7F
or when there is at least one value from set {0x80..0xFF} and is_utf8
(pseudo)-randomly returned false.
Suggested fix
-
Change YAML::PP::Schema::Binary module to work only with binary data as name suggests. Therefore with strings which contains only values from set {0x00..0xFF}.
-
Use encode_base64() always when input string is non-7-bit ASCII, therefore contains at least one value from set {0x80..0xFF}.
-
Decide if Base64 encoding is really needed for strings with character 0x7F. It is 7-bit ASCII therefore in 7-bit applications it is not needed special treatment for it. But I'm not sure if YAML needs special treatment of 0x7F or not.
CC @2shortplanks Please look at it and ideally do some correction if I wrote some mistake. Some parts I simplified to make it more easier to understand.