zinc icon indicating copy to clipboard operation
zinc copied to clipboard

ZnUTF8Encoder does not signal error when decoding #[16rF4 16r90 16r80 16r80]

Open Rinzwind opened this issue 2 years ago • 2 comments

In the following example, the #decodeBytes: message returns a String that cannot be encoded by #encodeString::

string := ZnCharacterEncoder utf8 decodeBytes: #[16rF4 16r90 16r80 16r80].
"⇒ string is equal to: (String with: (Character codePoint: 16r110000))"

ZnCharacterEncoder utf8 encodeString: string
"⇒ signals ZnInvalidUTF8: Character Unicode code point outside encoder range"

I think the #decodeBytes: message should already have signaled an error, as the byte sequence is not a well-formed UTF-8 byte sequence according to table 3-7 on p. 124 in section ‘3.9 Unicode Encoding Forms’ of ‘The Unicode Standard, Version 14.0 – Core Specification’. As the first byte is F4, the second byte should be in the range 80..8F to be part of a well-formed sequence.

Rinzwind avatar Jun 16 '22 19:06 Rinzwind

Thanks a lot for the feedback, the following commit should fix your issue:

https://github.com/svenvc/zinc/commit/37b7d0f83835520efbde74be65fef35ebc308b95

Can you check ?

svenvc avatar Jun 17 '22 10:06 svenvc

Looks OK, thanks!

BTW, maybe take a look at PTermUTF8Encoder, it might make sense to pull up the method #decodeStreamUpToIncomplete: to ZnUTF8Encoder. It decodes a stream of bytes, while offering two ‘conveniences’: substituting U+FFFD for invalid subsequences, and resetting the stream to the position before an incomplete subsequence at the end (if there is one). There’s a test in PTermUTF8EncoderTest. It’s used to handle output like in this example (from https://github.com/lxsang/PTerm/issues/28):

Example PTermUTF8Encoder

Rinzwind avatar Jun 17 '22 19:06 Rinzwind