peachpie Array keys encoding gets corrupted

Peachpie:

$chr = chr(225); // 'á' in Latin1
$arr = [$chr => "dummy"];
$key = key($arr); // Should be $chr = chr(225)
assert(base64_encode($key) == base64_encode($chr)); // <---- Fails on Peachpie !!! (Shouldn't fail)
assert($key == $chr); // Works

Zend's PHP:

See: http://sandbox.onlinephpfunctions.com/code/69c96410ed13d1386885b9ed1f14781abde91f5b

Jul 24 '20 23:07 kripper

MutableStringBlob.ToString() (in PhpValue::TryToIntStringKey) is corrupting the latin encoded strings, because it is doing public override string ToString() => ToString(Encoding.UTF8);, and chr(225), chr(226), etc. are all encoded as the UTF-8 replacement character ("?") causing the original data to be lost.

What would be the best solution in this case? Maybe we can encode the binary MutableStringBlob into Unicode strings using something like:

static byte[] GetBytes(string str)
{
    byte[] bytes = new byte[str.Length * sizeof(char)];
    System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

static string GetString(byte[] bytes)
{
    char[] chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

Plus, we would need to mark the IntStringKeys with a isBinary flag so that we later decode the string back to a PhpString or PhpValue type.

Does it make sense?

Jul 26 '20 04:07 kripper

In general, ToString() should not be used, the compiler should pass the current Context.Encoding everywhere.

Jul 26 '20 13:07 jakubmisek

How does using Context.StringEncoding make it possible to use any kind of binary strings when certain bytes have no representation in certain encodings?

For example, we could have a PHP array with binary keys for doing some binary format decoding, and this binary keys could have no associated encoding, just raw bytes.

Jul 26 '20 14:07 kripper

@kripper right, in that case, it wouldn't help

Jul 26 '20 21:07 jakubmisek

Ok.

Does it make sense to encode the binaries to string using BlockCopy and set a flag to IntStringKey to mark it as a binary key?
In general, why aren't we just encoding all PHP strings into Unicode strings with BlockCopy (as raw, without a given Encoding)? Maybe the encoding should only be done when the strings are printed to screen, send as a response in a ASP.NET page, etc. Also when reading strings defined in the source code into a PHP variable. Otherwise (when writing strings to files, network, etc) there should be no encoding.

Jul 26 '20 21:07 kripper

doesn't make any sense;

unnecessary copying, unnecessarily complex key comparison
8bit strings do not fit into UTF16 string much ..
keys entered in 8bit encoding wouldn't be equal to the same text in UTF-16
there is practically no opensource project that requires that (afaik MediaWiki has one library that uses binary array keys)

This behavior is in general by design, by purpose, and it is a known compatibility gap. It allows to make use of the best of .NET, allowing for both-way interoperability, and reduces memory and CPU usage significantly, and introduces Unicode into PHP ... we just have to find a reasonable balance ..

Jul 26 '20 21:07 jakubmisek

peachpie peachpie copied to clipboard

Array keys encoding gets corrupted

peachpie
peachpie copied to clipboard