peachpie
peachpie copied to clipboard
Array keys encoding gets corrupted
Peachpie:
$chr = chr(225); // 'á' in Latin1
$arr = [$chr => "dummy"];
$key = key($arr); // Should be $chr = chr(225)
assert(base64_encode($key) == base64_encode($chr)); // <---- Fails on Peachpie !!! (Shouldn't fail)
assert($key == $chr); // Works
Zend's PHP:
See: http://sandbox.onlinephpfunctions.com/code/69c96410ed13d1386885b9ed1f14781abde91f5b
MutableStringBlob.ToString() (in PhpValue::TryToIntStringKey) is corrupting the latin encoded strings, because it is doing public override string ToString() => ToString(Encoding.UTF8);, and chr(225), chr(226), etc. are all encoded as the UTF-8 replacement character ("?") causing the original data to be lost.
What would be the best solution in this case? Maybe we can encode the binary MutableStringBlob into Unicode strings using something like:
static byte[] GetBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
static string GetString(byte[] bytes)
{
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
return new string(chars);
}
Plus, we would need to mark the IntStringKeys with a isBinary flag so that we later decode the string back to a PhpString or PhpValue type.
Does it make sense?
In general, ToString() should not be used, the compiler should pass the current Context.Encoding everywhere.
How does using Context.StringEncoding make it possible to use any kind of binary strings when certain bytes have no representation in certain encodings?
For example, we could have a PHP array with binary keys for doing some binary format decoding, and this binary keys could have no associated encoding, just raw bytes.
@kripper right, in that case, it wouldn't help
Ok.
-
Does it make sense to encode the binaries to string using BlockCopy and set a flag to IntStringKey to mark it as a binary key?
-
In general, why aren't we just encoding all PHP strings into Unicode strings with BlockCopy (as raw, without a given Encoding)? Maybe the encoding should only be done when the strings are printed to screen, send as a response in a ASP.NET page, etc. Also when reading strings defined in the source code into a PHP variable. Otherwise (when writing strings to files, network, etc) there should be no encoding.
doesn't make any sense;
- unnecessary copying, unnecessarily complex key comparison
- 8bit strings do not fit into UTF16 string much ..
- keys entered in 8bit encoding wouldn't be equal to the same text in UTF-16
- there is practically no opensource project that requires that (afaik MediaWiki has one library that uses binary array keys)
This behavior is in general by design, by purpose, and it is a known compatibility gap. It allows to make use of the best of .NET, allowing for both-way interoperability, and reduces memory and CPU usage significantly, and introduces Unicode into PHP ... we just have to find a reasonable balance ..