jackson-core `UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding

When outputting a string value containing a supplementary Unicode code point, UTF8JsonGenerator is encoding the supplementary character as a pair of \uNNNN escapes representing the two halves of the surrogate pair that would denote the code point in UTF-16 instead of using the correct multi-byte UTF-8 encoding of the character. The following Groovy script demonstrates the behaviour:

@Grab(group='com.fasterxml.jackson.core', module='jackson-core', version='2.6.2')
import com.fasterxml.jackson.core.JsonFactory

def factory = new JsonFactory()
def bytes1 = new ByteArrayOutputStream()
def gen1 = factory.createGenerator(bytes1) // UTF8JsonGenerator
gen1.writeStartObject()
gen1.writeStringField("test", new String(Character.toChars(0x1F602)))
gen1.writeEndObject()
gen1.close()
System.out.write(bytes1.toByteArray())
println ""
// prints {"test":"\uD83D\uDE02"}


def bytes2 = new ByteArrayOutputStream()
new OutputStreamWriter(bytes2, "UTF-8").withWriter { w ->
  def gen2 = factory.createGenerator(w) // WriterBasedJsonGenerator
  gen2.writeStartObject()
  gen2.writeStringField("test", new String(Character.toChars(0x1F602)))
  gen2.writeEndObject()
  gen2.close()
}
System.out.write(bytes2.toByteArray())
println ""
// prints {"test":"😂"}

When generating to a Writer rather than an OutputStream (and letting Java handle the UTF-8 byte conversion) the supplementary character U+1F602 is encoded as the correct UTF-8 four byte sequence f0 9f 98 82.

Oct 11 '15 20:10 ianroberts

Yes. Unfortunately this is how JSON specification mandates escaping of these characters:

"To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". "

(Section 9, "String", http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf -- same as what earlier JSON specifications have said)

So although native UTF-8 representation would use 4-byte sequence (and one that I personally agree would be the obvious correct choice), my understanding that JSON specification requires different handling. If there are other interpretations or specifications wrt this issue I would be interested in those.

I would be open to addition of JsonGenerator.Feature that would allow more natural UTF-8 encoding to be used.

Oct 12 '15 02:10 cowtowncoder

True, on closer reading that is how the spec requires them to be escaped if you choose to escape them, but supplementary characters are not required to be escaped so an option to control it would be reasonable.

Oct 12 '15 08:10 ianroberts

From that same section of the spec:

All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.

Oct 12 '15 08:10 ianroberts

Hmmh. Ok, fair enough. I am not a fan of using somewhat broken escaping anyway... So it does seem like output could be changed. But just in case some code out there would find the change unpalatable (may seem unlikely, but there always tends to be some user somewhere that does report problems), I think this needs to go in 2.7. I could then add a JsonGenerator.Feature, but default it so that native UTF-8 encoding is used unless feature is changed to force escaping.

Thank you for reporting this!

Oct 12 '15 16:10 cowtowncoder

That sounds perfectly fair. Both forms would parse to the same result for a spec-compliant parser so it's definitely not a critical bug.

Oct 12 '15 17:10 ianroberts

Quick note: hoping to fix this before 2.7.0-rc1 goes out, which is to happen soon (ideally within a week or so but we'll see).

Nov 16 '15 04:11 cowtowncoder

Ok. Turns out that the fix is not quite as easy as I had hoped. I forgot that the thing that makes this complex is the requirement to have access to 2 chars instead of single one; and that requires propagation of input as well as return value to indicate an "extra" character getting consumed. So I added failing tests for handling, but have not been able to improve code.

Nov 24 '15 06:11 cowtowncoder

Sort of related, #307.

Aug 11 '16 05:08 cowtowncoder

As already discussed above, the ECMA specification allows (but does not mandate) using \uHHHH escaping for Unicode characters (including ones that are represented with surrogate pairs in UTF-16).

Note that using \uHHHH, though correct and valid has 2 big deficiencies:

it bloats the binary size of the serialized JSON, e.g. for the surrogate pair discussed here, it will take 12 bytes to be represented, while it is actually a single Unicode code point, which requires no more than 5 bytes to be represented natively in UTF-8
the serialized JSON is completely unreadable, which defies one of its main advantages

To me the \uHHHH support in JSON serves to escape characters that are not representable in the used encoding (say ASCII). This has no meaningful usage nowadays that JSON should be UTF-8 and SHALL be UTF-8/16/32 according to the latest RFC http://www.rfc-editor.org/rfc/rfc7159.txt

All that said, I do think the option to use \uHHHH is required and this option must default to not using such escaping.

Nov 21 '16 10:11 mtsvetanov

Hi, is there any chance that this issue will fixed soon? Robert

Apr 20 '17 09:04 fiserro

@fiserro I haven't had time to work on this and have nothing planned.

Apr 20 '17 15:04 cowtowncoder

hi @cowtowncoder , Is this still a work in progress?

Nov 12 '19 11:11 abhijeethp

@rkinabhi It is something that would be nice to resolve but I am not actively working on it at this point (I try to add active label on things I do work on).

Nov 12 '19 22:11 cowtowncoder

To print emoji while use ByteOutputStream. It can be print using writeRawValue(cbuf, offset, len) function.

String emoji = new String(Character.toChars(0x1F602));
char[] charArray = String.format("\"%s\"", emoji).toCharArray();
JsonFactory factory = new JsonFactory();
ByteArrayOutputStream bytes3 = new ByteArrayOutputStream();
JsonGenerator gen3 = factory.createGenerator(bytes3); // UTF8JsonGenerator
gen3.writeStartObject();
//gen1.writeStringField("test", new String(Character.toChars(0x1F602)));
gen3.writeFieldName("test");
gen3.writeRawValue(charArray, 0, charArray.length);			
gen3.writeEndObject();
gen3.close();
System.out.write(bytes3.toByteArray());
System.out.println(new String(bytes3.toByteArray()));
// prints {"test":"😂"}

Jan 10 '24 08:01 gymnopedy01

It can be print using writeRawValue(cbuf, offset, len) function.

That's all very well for values you completely control, but you would need to manually take care of escaping everything else apart from the supplementary characters according to JSON rules (\ => \\, " => \", newline => \n, etc. etc.).

Jan 10 '24 11:01 ianroberts

That's all very well for values you completely control, but you would need to manually take care of escaping everything else apart from the supplementary characters according to JSON rules (\ => \\, " => \", newline => \n, etc. etc.).

I fixed what you said that JSON Rule

Add JSON Rule Processing

String emoji4 = new String(Character.toChars(0x1F602));
char[] cEmoji4 = JsonStringEncoder.getInstance().quoteAsString(String.format("\\\"\n{%s}", emoji4));
char[] charArray4 = new char[cEmoji4.length + 2];
System.arraycopy(cEmoji4, 0, charArray4, 1, cEmoji4.length);
charArray4[0] = '"';
charArray4[charArray4.length - 1] = '"';

ByteArrayOutputStream bytes4 = new ByteArrayOutputStream();
JsonGenerator gen4 = factory.createGenerator(bytes4); // UTF8JsonGenerator
gen4.writeStartObject();
gen4.writeFieldName("test");
gen4.writeRawValue(charArray4, 0, charArray4.length);
gen4.writeEndObject();
gen4.close();

System.out.write(bytes4.toByteArray());
System.out.println(new String(bytes4.toByteArray()));

// prints {"test":"\\\"\n{😂}"}{"test":"\\\"\n{😂}"}

Jan 11 '24 10:01 gymnopedy01

jackson-core jackson-core copied to clipboard

`UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding

jackson-core
jackson-core copied to clipboard