jackson-core Improve performance of writing raw UTF-8 encoded byte arrays

The output escape table covers just 7-bits, meaning that a raw UTF-8 byte cannot be used to index into the table without a branch test for negative bytes (i.e. bytes larger than 0x7F). This extra check occurs in a tight loop and can be avoided if the lookup table were to cover all 8-bit indices.

This commit introduces ad-hoc logic in UTF8JsonGenerator#writeUTF8String to create an extended copy of _outputEscapes if necessary, writing the copy back into the field to avoid having to compute it again within the same generator instance (unless it is changed). This ad-hoc strategy was chosen as it is the least disruptive to existing code, as a larger-scale change around CharacterEscapes would impact public api or otherwise subtle chances for breakages.

Some quick and dirty JMH tests on M1 Max (arm64 ⚠) with Azul Zulu JDK 21, this shows the following numbers:

Benchmark                (length)  (needEscape)  (optimized)   Mode  Cnt   Score   Error   Units
JmhTest.writeUtf8String        32         first         true  thrpt   40  32,156 ± 0,084  ops/us
JmhTest.writeUtf8String        32         first        false  thrpt   40  27,936 ± 0,106  ops/us
JmhTest.writeUtf8String        32          last         true  thrpt   40  33,049 ± 0,091  ops/us
JmhTest.writeUtf8String        32          last        false  thrpt   40  29,605 ± 0,102  ops/us
JmhTest.writeUtf8String        32          none         true  thrpt   40  32,922 ± 0,192  ops/us
JmhTest.writeUtf8String        32          none        false  thrpt   40  29,654 ± 0,074  ops/us
JmhTest.writeUtf8String       256         first         true  thrpt   40   6,350 ± 0,023  ops/us
JmhTest.writeUtf8String       256         first        false  thrpt   40   4,734 ± 0,012  ops/us
JmhTest.writeUtf8String       256          last         true  thrpt   40   6,399 ± 0,018  ops/us
JmhTest.writeUtf8String       256          last        false  thrpt   40   4,759 ± 0,017  ops/us
JmhTest.writeUtf8String       256          none         true  thrpt   40   6,402 ± 0,021  ops/us
JmhTest.writeUtf8String       256          none        false  thrpt   40   4,751 ± 0,025  ops/us
JmhTest.writeUtf8String       512         first         true  thrpt   40   3,215 ± 0,030  ops/us
JmhTest.writeUtf8String       512         first        false  thrpt   40   2,478 ± 0,008  ops/us
JmhTest.writeUtf8String       512          last         true  thrpt   40   3,259 ± 0,012  ops/us
JmhTest.writeUtf8String       512          last        false  thrpt   40   2,480 ± 0,026  ops/us
JmhTest.writeUtf8String       512          none         true  thrpt   40   3,262 ± 0,013  ops/us
JmhTest.writeUtf8String       512          none        false  thrpt   40   2,486 ± 0,007  ops/us

This is writing buffers of length (length) that contain 'a' in all positions, where (needEscape) being 'first' has the first byte overwritten with ", or 'last' its last byte, versus 'none' where the buffer remains a sequence where no escapes need to be inserted.

Overall the numbers show improvements in the range of 11%–33%. I wonder if this extends to other CPU architectures, opening this PR to gauge interest in such a change. Note that this only affects UTF8JsonGenerator#writeUTF8String which isn't typically used, as it's more common to process from char[] or String buffers. In my use-case I already have an UTF-8 encoded byte[] which prompted me looking into this.

This logic can probably be vectorized quite nicely, that is also being done in dotnet's JSON writer infrastructure.

Oct 20 '24 12:10 JoostK

Thanks @JoostK. Would you be able to add the benchmark to https://github.com/FasterXML/jackson-core/tree/2.19/src/test/java/perf ?

Oct 21 '24 17:10 pjfanning

Sure I can add some; while looking at the existing ones I wonder what the desired testing strategy is:

extract both write loops to be able to compare prior state (7-bit LUT) versus new state (8-bit LUT), or
call into JsonGenerator#writeUTF8String and then running the test with and without the change applied, possibly adding char[] writing as a comparative benchmark.

What is the most valuable thing to have here? 1. is meaningful to compare this particular change across machines/JVMs, but 2. is more valuable to measure and compare write perf of JsonGenerator going forward.

Oct 21 '24 21:10 JoostK

Both sound useful - could you add both benchmarks?

Oct 21 '24 22:10 pjfanning

Both sound useful - could you add both benchmarks?

I'll come up with something, probably over the coming days.

Oct 21 '24 22:10 JoostK

Accidentally rebased onto master, unaware that this PR was targeting 2.19. Reverted back to 2.19.

Here's the results on my MBP w/ M1 Max:

after:

Length    8,  none escape: (7-bit, 8-bit, JsonGenerator):  57,4 /  47,8 / 196,7 msecs
Length    8, start escape: (7-bit, 8-bit, JsonGenerator):  84,3 /  76,2 / 198,5 msecs
Length    8,   end escape: (7-bit, 8-bit, JsonGenerator):  73,8 /  70,9 / 231,6 msecs
Length   16,  none escape: (7-bit, 8-bit, JsonGenerator):  64,0 /  56,3 / 120,0 msecs
Length   16, start escape: (7-bit, 8-bit, JsonGenerator):  73,5 /  62,4 / 129,5 msecs
Length   16,   end escape: (7-bit, 8-bit, JsonGenerator):  67,7 /  59,4 / 172,5 msecs
Length   32,  none escape: (7-bit, 8-bit, JsonGenerator):  63,0 /  52,9 /  85,8 msecs
Length   32, start escape: (7-bit, 8-bit, JsonGenerator):  67,0 /  56,4 / 103,6 msecs
Length   32,   end escape: (7-bit, 8-bit, JsonGenerator):  63,3 /  54,3 / 146,5 msecs
Length  256,  none escape: (7-bit, 8-bit, JsonGenerator):  60,1 /  52,1 /  60,7 msecs
Length  256, start escape: (7-bit, 8-bit, JsonGenerator):  60,8 /  54,7 /  83,5 msecs
Length  256,   end escape: (7-bit, 8-bit, JsonGenerator):  61,9 /  51,7 / 138,7 msecs
Length  512,  none escape: (7-bit, 8-bit, JsonGenerator):  59,5 /  50,9 /  56,5 msecs
Length  512, start escape: (7-bit, 8-bit, JsonGenerator):  61,6 /  53,3 /  79,3 msecs
Length  512,   end escape: (7-bit, 8-bit, JsonGenerator):  60,8 /  50,1 / 132,9 msecs
Length 1024,  none escape: (7-bit, 8-bit, JsonGenerator):  60,2 /  50,7 /  95,5 msecs
Length 1024, start escape: (7-bit, 8-bit, JsonGenerator):  60,3 /  52,5 /  78,7 msecs
Length 1024,   end escape: (7-bit, 8-bit, JsonGenerator):  60,6 /  50,1 /  97,0 msecs
Length 8192,  none escape: (7-bit, 8-bit, JsonGenerator):  58,0 /  49,0 /  32,4 msecs
Length 8192, start escape: (7-bit, 8-bit, JsonGenerator):  59,1 /  49,0 /  38,1 msecs
Length 8192,   end escape: (7-bit, 8-bit, JsonGenerator):  58,9 /  48,9 /  34,6 msecs



before:

Length    8,  none escape: (7-bit, 8-bit, JsonGenerator):  58,8 /  45,4 / 196,7 msecs
Length    8, start escape: (7-bit, 8-bit, JsonGenerator):  84,9 /  76,1 / 201,4 msecs
Length    8,   end escape: (7-bit, 8-bit, JsonGenerator):  74,2 /  70,7 / 230,7 msecs
Length   16,  none escape: (7-bit, 8-bit, JsonGenerator):  65,2 /  56,1 / 121,2 msecs
Length   16, start escape: (7-bit, 8-bit, JsonGenerator):  74,0 /  62,3 / 133,6 msecs
Length   16,   end escape: (7-bit, 8-bit, JsonGenerator):  67,8 /  59,3 / 178,4 msecs
Length   32,  none escape: (7-bit, 8-bit, JsonGenerator):  62,9 /  52,7 /  87,1 msecs
Length   32, start escape: (7-bit, 8-bit, JsonGenerator):  67,0 /  56,4 / 106,5 msecs
Length   32,   end escape: (7-bit, 8-bit, JsonGenerator):  63,4 /  54,2 / 155,5 msecs
Length  256,  none escape: (7-bit, 8-bit, JsonGenerator):  60,2 /  52,2 /  61,7 msecs
Length  256, start escape: (7-bit, 8-bit, JsonGenerator):  63,2 /  54,1 /  86,0 msecs
Length  256,   end escape: (7-bit, 8-bit, JsonGenerator):  62,1 /  52,0 / 142,1 msecs
Length  512,  none escape: (7-bit, 8-bit, JsonGenerator):  59,1 /  51,1 /  56,9 msecs
Length  512, start escape: (7-bit, 8-bit, JsonGenerator):  62,2 /  53,1 /  82,6 msecs
Length  512,   end escape: (7-bit, 8-bit, JsonGenerator):  60,6 /  50,2 / 136,0 msecs
Length 1024,  none escape: (7-bit, 8-bit, JsonGenerator):  59,9 /  50,7 /  96,4 msecs
Length 1024, start escape: (7-bit, 8-bit, JsonGenerator):  60,6 /  52,8 /  81,3 msecs
Length 1024,   end escape: (7-bit, 8-bit, JsonGenerator):  60,5 /  49,7 /  97,0 msecs
Length 8192,  none escape: (7-bit, 8-bit, JsonGenerator):  59,1 /  48,9 /  45,1 msecs
Length 8192, start escape: (7-bit, 8-bit, JsonGenerator):  59,7 /  49,1 /  49,5 msecs
Length 8192,   end escape: (7-bit, 8-bit, JsonGenerator):  58,6 /  49,0 /  47,3 msecs


combined JsonGenerator results:

Length    8,  none escape: (before, after): 196,7  / 196,7 msecs
Length    8, start escape: (before, after): 201,4  / 198,5 msecs
Length    8,   end escape: (before, after): 230,7  / 231,6 msecs
Length   16,  none escape: (before, after): 121,2  / 120,0 msecs
Length   16, start escape: (before, after): 133,6  / 129,5 msecs
Length   16,   end escape: (before, after): 178,4  / 172,5 msecs
Length   32,  none escape: (before, after):  87,1  /  85,8 msecs
Length   32, start escape: (before, after): 106,5  / 103,6 msecs
Length   32,   end escape: (before, after): 155,5  / 146,5 msecs
Length  256,  none escape: (before, after):  61,7  /  60,7 msecs
Length  256, start escape: (before, after):  86,0  /  83,5 msecs
Length  256,   end escape: (before, after): 142,1  / 138,7 msecs
Length  512,  none escape: (before, after):  56,9  /  56,5 msecs
Length  512, start escape: (before, after):  82,6  /  79,3 msecs
Length  512,   end escape: (before, after): 136,0  / 132,9 msecs
Length 1024,  none escape: (before, after):  96,4  /  95,5 msecs
Length 1024, start escape: (before, after):  81,3  /  78,7 msecs
Length 1024,   end escape: (before, after):  97,0  /  97,0 msecs
Length 8192,  none escape: (before, after):  45,1  /  32,4 msecs
Length 8192, start escape: (before, after):  49,5  /  38,1 msecs
Length 8192,   end escape: (before, after):  47,3  /  34,6 msecs

It's not as big of a difference I was seeing in the original benchmarks I had, although it's noticeable how larger strings appear to benefit quite a bit.

There is potentially another question of whether UTF8JsonGenerator#_extendOutputEscapesTo8Bits should overwrite JsonGeneratorImpl#sOutputEscapes, as to avoid having to recreate the 8-bit wide LUT for each individual JsonGenerator instance. I opted not to change this since that crosses UTF8JsonGenerator's boundary into the parent class, as well as demoting JsonGeneratorImpl#sOutputEscapes to a non-final field, which feels iffy.

Oct 27 '24 12:10 JoostK

@JoostK First of all: thank you for contributing this! At high level this makes sense, but I do need to dig bit deeper into this when I have time -- and right now I am bit overloaded/overspread so apologies for delay there may be.

Having said that: one thing we will eventually need (if not done; apologies if it has) is to get a CLA, from here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

(needs to be sent just once for all Jackson contributions)

the usual way is to print, fill & sign, scan/photo, email to "cla" at fasterxml dot com. Once I receive it we are good to wrt merging (obv pending code review).

Thank you again; looking forward to getting this merged!

Oct 27 '24 18:10 cowtowncoder

@JoostK Ok, I now understand the scope and added some suggestions. Since we are not using the new encoding table for all output, I think it's better to avoid overriding _outputEscapes.

But one thing I am wondering is whether this optimization came about actual observations of usage -- that is, if this was merged, would it help with use case you have? This is because as you say, this method seems like less commonly used and if so, benefits might be limited. But if it addresses something that at least you use, it is more reasonable to merge it.

Nov 04 '24 23:11 cowtowncoder

Thanks for the comments, I haven't gotten to address them yet—nor signing the CLA, that won't be a problem but I don't typically have a printer at hand 😄.

My use-case is for UTF-8 encoded bytes that is being read from raw blobs, which are to be sent as a JSON-encoded string (it is known to be UTF-8 encoded) in a web response; this can be in the order of ~100MB of data across ~30k strings, so writeUTF8String is ideal to avoid the need to allocate+decode into a String.

Since the app will likely be deployed on X86_64, I intent to run the performance test on amd64 to gauge what the impact is there, as I suspect this may depend on the ISA (and possibly how effective a branch predictor is, so may be different across CPU vendors/generations). If the meaningful improvements are only applicable to arm64/M1 then this may not be as beneficial as I observed on macOS.

Nov 07 '24 19:11 JoostK

No worries @JoostK . FWTW, printing is optional; modifying PDF with info & fake signature works perfectly fine too.

And thx for sharing use case: sounds legit.

Nov 07 '24 21:11 cowtowncoder

Finally found sime time to run this on Intel x64 (specifically Core i7 1270P) but the results are all over the place, so I can't draw conclusive results from it for now. Interestingly the microbenchmark consistently shows that the 7-bit approach performs better on this CPU, which is opposite of what I was measuring on arm64 (M1 Max). On the actual JsonGenerator test however I do see improvements, but those results are wildly unstable. I am a bit puzzled 🤷‍♂️

Nov 22 '24 17:11 JoostK

Not sure what this means for this PR, really. I'd really like to get stable results to make informed decisions on whether this is worth it. Maybe somebody else is able to run the test suite on amd64 CPUs?

Nov 22 '24 17:11 JoostK

@JoostK Ok that is... interesting. Given that code seems like it should out-perform existing implementation. I assume you tried with longer test run times but without seeing more stable results?

Nov 22 '24 18:11 cowtowncoder

jackson-core jackson-core copied to clipboard

Improve performance of writing raw UTF-8 encoded byte arrays

jackson-core
jackson-core copied to clipboard