zinc
zinc copied to clipboard
ZnUTF8Encoder does not signal error when decoding #[16rF4 16r90 16r80 16r80]
In the following example, the #decodeBytes:
message returns a String that cannot be encoded by #encodeString:
:
string := ZnCharacterEncoder utf8 decodeBytes: #[16rF4 16r90 16r80 16r80].
"⇒ string is equal to: (String with: (Character codePoint: 16r110000))"
ZnCharacterEncoder utf8 encodeString: string
"⇒ signals ZnInvalidUTF8: Character Unicode code point outside encoder range"
I think the #decodeBytes:
message should already have signaled an error, as the byte sequence is not a well-formed UTF-8 byte sequence according to table 3-7 on p. 124 in section ‘3.9 Unicode Encoding Forms’ of ‘The Unicode Standard, Version 14.0 – Core Specification’. As the first byte is F4, the second byte should be in the range 80..8F to be part of a well-formed sequence.
Thanks a lot for the feedback, the following commit should fix your issue:
https://github.com/svenvc/zinc/commit/37b7d0f83835520efbde74be65fef35ebc308b95
Can you check ?
Looks OK, thanks!
BTW, maybe take a look at PTermUTF8Encoder, it might make sense to pull up the method #decodeStreamUpToIncomplete:
to ZnUTF8Encoder. It decodes a stream of bytes, while offering two ‘conveniences’: substituting U+FFFD for invalid subsequences, and resetting the stream to the position before an incomplete subsequence at the end (if there is one). There’s a test in PTermUTF8EncoderTest. It’s used to handle output like in this example (from https://github.com/lxsang/PTerm/issues/28):
