Behaviour of Bytestring.utf8() with bytes that aren't valid UTF-8
Is there a defined behaviour for ByteString.utf8(...) for sequences of bytes which are not valid UTF-8?
I've tried the following:
bytes[] invalidUtf8 = {(byte)0xFF}; // 0xFF will never appear in valid UTF-8
ByteString b = ByteString.of(invalidUtf8);
System.out.println(b.utf8()); // �
I was expecting an exception of some kind (though I could very will be incorrect with my "invalid" UTF-8). Is there a defined behaviour in this case?
The behavior is well-defined but not well-documented.
When converting UTF-8 bytes to a UTF-16 Java string ("decoding UTF-8"), we replace invalid sequences with the Unicode replacement character, �.
When converting UTF-16 chars to a UTF-8 bytes ("encoding UTF-8"), we replace invalid UTF-16 sequences with the ASCII question mark, ?.
We didn't invent this policy; we just do what Android was doing (and the JVM as well) when we reimplemented it for better performance.
We should document this somewhere obvious.
Thanks for the quick reply!
Is it a safe assumption that if the result of ByteString.utf8() contains the unicode replacement character then the underlying bytes are not valid UTF-8?
Is it a safe assumption that if the result of
ByteString.utf8()contains the unicode replacement character then the underlying bytes are not valid UTF-8?
Not really. The replacement character is a valid codepoint that can legitimately be part of a string. See for example @swankjesse's comment above.
We should document this somewhere obvious.
I've added some documentation here. Is this enough or would you want more?
You could add some examples of each case along. An example of decoding UTF-8 can be seen in the description. An encoding UTF-8 example can be this:
With Java APIs:
char[] chars = {'\uD800', 'a'}; // Unpaired surrogate
String str = new String(data);
System.out.println(str); // ?a
With Okio APIs:
char[] chars = {'\uD800', 'a'}; // Unpaired surrogate
ByteString byteString = ByteString.encodeString(new String(chars), Charset.forName("UTF-8"));
System.out.println(byteString.utf8()); // ?a
@petedmarsh One way to check whether the bytes are valid UTF-8 is using these Java APIs:
bytes[] invalidUtf8 = {(byte)0xFF}; // 0xFF will never appear in valid UTF-8
java.nio.ByteBuffer buffer = ByteBuffer.wrap(invalidUtf8);
try {
Charset.forName("UTF-8").newDecoder().decode(buffer);
} catch (CharacterCodingException e) {
e.printStackTrace();
}
Btw, I've noticed Guava has a isWellFormed method to check if a byte[] is well-formed UTF-8 byte sequence. Would it make sense to add this to Okio?
You can check with Okio by encoding and decoding and comparing. Valid UTF-8 will roundtrip without changes; invalid UTF-8 won't.