jackson-core
jackson-core copied to clipboard
`UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding
When outputting a string value containing a supplementary Unicode code point, UTF8JsonGenerator is encoding the supplementary character as a pair of \uNNNN
escapes representing the two halves of the surrogate pair that would denote the code point in UTF-16 instead of using the correct multi-byte UTF-8 encoding of the character. The following Groovy script demonstrates the behaviour:
@Grab(group='com.fasterxml.jackson.core', module='jackson-core', version='2.6.2')
import com.fasterxml.jackson.core.JsonFactory
def factory = new JsonFactory()
def bytes1 = new ByteArrayOutputStream()
def gen1 = factory.createGenerator(bytes1) // UTF8JsonGenerator
gen1.writeStartObject()
gen1.writeStringField("test", new String(Character.toChars(0x1F602)))
gen1.writeEndObject()
gen1.close()
System.out.write(bytes1.toByteArray())
println ""
// prints {"test":"\uD83D\uDE02"}
def bytes2 = new ByteArrayOutputStream()
new OutputStreamWriter(bytes2, "UTF-8").withWriter { w ->
def gen2 = factory.createGenerator(w) // WriterBasedJsonGenerator
gen2.writeStartObject()
gen2.writeStringField("test", new String(Character.toChars(0x1F602)))
gen2.writeEndObject()
gen2.close()
}
System.out.write(bytes2.toByteArray())
println ""
// prints {"test":"😂"}
When generating to a Writer rather than an OutputStream (and letting Java handle the UTF-8 byte conversion) the supplementary character U+1F602 is encoded as the correct UTF-8 four byte sequence f0 9f 98 82
.
Yes. Unfortunately this is how JSON specification mandates escaping of these characters:
"To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". "
(Section 9, "String", http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf -- same as what earlier JSON specifications have said)
So although native UTF-8 representation would use 4-byte sequence (and one that I personally agree would be the obvious correct choice), my understanding that JSON specification requires different handling. If there are other interpretations or specifications wrt this issue I would be interested in those.
I would be open to addition of JsonGenerator.Feature
that would allow more natural UTF-8 encoding to be used.
True, on closer reading that is how the spec requires them to be escaped if you choose to escape them, but supplementary characters are not required to be escaped so an option to control it would be reasonable.
From that same section of the spec:
All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.
Hmmh. Ok, fair enough. I am not a fan of using somewhat broken escaping anyway...
So it does seem like output could be changed.
But just in case some code out there would find the change unpalatable (may seem unlikely, but there always tends to be some user somewhere that does report problems), I think this needs to go in 2.7. I could then add a JsonGenerator.Feature
, but default it so that native UTF-8 encoding is used unless feature is changed to force escaping.
Thank you for reporting this!
That sounds perfectly fair. Both forms would parse to the same result for a spec-compliant parser so it's definitely not a critical bug.
Quick note: hoping to fix this before 2.7.0-rc1 goes out, which is to happen soon (ideally within a week or so but we'll see).
Ok. Turns out that the fix is not quite as easy as I had hoped. I forgot that the thing that makes this complex is the requirement to have access to 2 chars instead of single one; and that requires propagation of input as well as return value to indicate an "extra" character getting consumed. So I added failing tests for handling, but have not been able to improve code.
Sort of related, #307.
As already discussed above, the ECMA specification allows (but does not mandate) using \uHHHH escaping for Unicode characters (including ones that are represented with surrogate pairs in UTF-16).
Note that using \uHHHH, though correct and valid has 2 big deficiencies:
-
it bloats the binary size of the serialized JSON, e.g. for the surrogate pair discussed here, it will take 12 bytes to be represented, while it is actually a single Unicode code point, which requires no more than 5 bytes to be represented natively in UTF-8
-
the serialized JSON is completely unreadable, which defies one of its main advantages
To me the \uHHHH support in JSON serves to escape characters that are not representable in the used encoding (say ASCII). This has no meaningful usage nowadays that JSON should be UTF-8 and SHALL be UTF-8/16/32 according to the latest RFC http://www.rfc-editor.org/rfc/rfc7159.txt
All that said, I do think the option to use \uHHHH is required and this option must default to not using such escaping.
Hi, is there any chance that this issue will fixed soon? Robert
@fiserro I haven't had time to work on this and have nothing planned.
hi @cowtowncoder , Is this still a work in progress?
@rkinabhi It is something that would be nice to resolve but I am not actively working on it at this point (I try to add active
label on things I do work on).
To print emoji while use ByteOutputStream. It can be print using writeRawValue(cbuf, offset, len) function.
String emoji = new String(Character.toChars(0x1F602));
char[] charArray = String.format("\"%s\"", emoji).toCharArray();
JsonFactory factory = new JsonFactory();
ByteArrayOutputStream bytes3 = new ByteArrayOutputStream();
JsonGenerator gen3 = factory.createGenerator(bytes3); // UTF8JsonGenerator
gen3.writeStartObject();
//gen1.writeStringField("test", new String(Character.toChars(0x1F602)));
gen3.writeFieldName("test");
gen3.writeRawValue(charArray, 0, charArray.length);
gen3.writeEndObject();
gen3.close();
System.out.write(bytes3.toByteArray());
System.out.println(new String(bytes3.toByteArray()));
// prints {"test":"😂"}
It can be print using writeRawValue(cbuf, offset, len) function.
That's all very well for values you completely control, but you would need to manually take care of escaping everything else apart from the supplementary characters according to JSON rules (\
=> \\
, "
=> \"
, newline => \n
, etc. etc.).
That's all very well for values you completely control, but you would need to manually take care of escaping everything else apart from the supplementary characters according to JSON rules (
\
=>\\
,"
=>\"
, newline =>\n
, etc. etc.).
I fixed what you said that JSON Rule
Add JSON Rule Processing
String emoji4 = new String(Character.toChars(0x1F602));
char[] cEmoji4 = JsonStringEncoder.getInstance().quoteAsString(String.format("\\\"\n{%s}", emoji4));
char[] charArray4 = new char[cEmoji4.length + 2];
System.arraycopy(cEmoji4, 0, charArray4, 1, cEmoji4.length);
charArray4[0] = '"';
charArray4[charArray4.length - 1] = '"';
ByteArrayOutputStream bytes4 = new ByteArrayOutputStream();
JsonGenerator gen4 = factory.createGenerator(bytes4); // UTF8JsonGenerator
gen4.writeStartObject();
gen4.writeFieldName("test");
gen4.writeRawValue(charArray4, 0, charArray4.length);
gen4.writeEndObject();
gen4.close();
System.out.write(bytes4.toByteArray());
System.out.println(new String(bytes4.toByteArray()));
// prints {"test":"\\\"\n{😂}"}{"test":"\\\"\n{😂}"}