jackson-core
jackson-core copied to clipboard
Improve performance of writing raw UTF-8 encoded byte arrays
The output escape table covers just 7-bits, meaning that a raw UTF-8 byte cannot be used to index into the table without a branch test for negative bytes (i.e. bytes larger than 0x7F). This extra check occurs in a tight loop and can be avoided if the lookup table were to cover all 8-bit indices.
This commit introduces ad-hoc logic in UTF8JsonGenerator#writeUTF8String to create an extended copy of _outputEscapes if necessary, writing the copy back into the field to avoid having to compute it again within the same generator instance (unless it is changed). This ad-hoc strategy was chosen as it is the least disruptive to existing code, as a larger-scale change around CharacterEscapes would impact public api or otherwise subtle chances for breakages.
Some quick and dirty JMH tests on M1 Max (arm64 ⚠) with Azul Zulu JDK 21, this shows the following numbers:
Benchmark (length) (needEscape) (optimized) Mode Cnt Score Error Units
JmhTest.writeUtf8String 32 first true thrpt 40 32,156 ± 0,084 ops/us
JmhTest.writeUtf8String 32 first false thrpt 40 27,936 ± 0,106 ops/us
JmhTest.writeUtf8String 32 last true thrpt 40 33,049 ± 0,091 ops/us
JmhTest.writeUtf8String 32 last false thrpt 40 29,605 ± 0,102 ops/us
JmhTest.writeUtf8String 32 none true thrpt 40 32,922 ± 0,192 ops/us
JmhTest.writeUtf8String 32 none false thrpt 40 29,654 ± 0,074 ops/us
JmhTest.writeUtf8String 256 first true thrpt 40 6,350 ± 0,023 ops/us
JmhTest.writeUtf8String 256 first false thrpt 40 4,734 ± 0,012 ops/us
JmhTest.writeUtf8String 256 last true thrpt 40 6,399 ± 0,018 ops/us
JmhTest.writeUtf8String 256 last false thrpt 40 4,759 ± 0,017 ops/us
JmhTest.writeUtf8String 256 none true thrpt 40 6,402 ± 0,021 ops/us
JmhTest.writeUtf8String 256 none false thrpt 40 4,751 ± 0,025 ops/us
JmhTest.writeUtf8String 512 first true thrpt 40 3,215 ± 0,030 ops/us
JmhTest.writeUtf8String 512 first false thrpt 40 2,478 ± 0,008 ops/us
JmhTest.writeUtf8String 512 last true thrpt 40 3,259 ± 0,012 ops/us
JmhTest.writeUtf8String 512 last false thrpt 40 2,480 ± 0,026 ops/us
JmhTest.writeUtf8String 512 none true thrpt 40 3,262 ± 0,013 ops/us
JmhTest.writeUtf8String 512 none false thrpt 40 2,486 ± 0,007 ops/us
This is writing buffers of length (length) that contain 'a' in all positions, where (needEscape) being 'first' has the first byte overwritten with ", or 'last' its last byte, versus 'none' where the buffer remains a sequence where no escapes need to be inserted.
Overall the numbers show improvements in the range of 11%–33%. I wonder if this extends to other CPU architectures, opening this PR to gauge interest in such a change. Note that this only affects UTF8JsonGenerator#writeUTF8String which isn't typically used, as it's more common to process from char[] or String buffers. In my use-case I already have an UTF-8 encoded byte[] which prompted me looking into this.
This logic can probably be vectorized quite nicely, that is also being done in dotnet's JSON writer infrastructure.
Thanks @JoostK. Would you be able to add the benchmark to https://github.com/FasterXML/jackson-core/tree/2.19/src/test/java/perf ?
Sure I can add some; while looking at the existing ones I wonder what the desired testing strategy is:
- extract both write loops to be able to compare prior state (7-bit LUT) versus new state (8-bit LUT), or
- call into
JsonGenerator#writeUTF8Stringand then running the test with and without the change applied, possibly addingchar[]writing as a comparative benchmark.
What is the most valuable thing to have here? 1. is meaningful to compare this particular change across machines/JVMs, but 2. is more valuable to measure and compare write perf of JsonGenerator going forward.
Both sound useful - could you add both benchmarks?
Both sound useful - could you add both benchmarks?
I'll come up with something, probably over the coming days.
Accidentally rebased onto master, unaware that this PR was targeting 2.19. Reverted back to 2.19.
Here's the results on my MBP w/ M1 Max:
after:
Length 8, none escape: (7-bit, 8-bit, JsonGenerator): 57,4 / 47,8 / 196,7 msecs
Length 8, start escape: (7-bit, 8-bit, JsonGenerator): 84,3 / 76,2 / 198,5 msecs
Length 8, end escape: (7-bit, 8-bit, JsonGenerator): 73,8 / 70,9 / 231,6 msecs
Length 16, none escape: (7-bit, 8-bit, JsonGenerator): 64,0 / 56,3 / 120,0 msecs
Length 16, start escape: (7-bit, 8-bit, JsonGenerator): 73,5 / 62,4 / 129,5 msecs
Length 16, end escape: (7-bit, 8-bit, JsonGenerator): 67,7 / 59,4 / 172,5 msecs
Length 32, none escape: (7-bit, 8-bit, JsonGenerator): 63,0 / 52,9 / 85,8 msecs
Length 32, start escape: (7-bit, 8-bit, JsonGenerator): 67,0 / 56,4 / 103,6 msecs
Length 32, end escape: (7-bit, 8-bit, JsonGenerator): 63,3 / 54,3 / 146,5 msecs
Length 256, none escape: (7-bit, 8-bit, JsonGenerator): 60,1 / 52,1 / 60,7 msecs
Length 256, start escape: (7-bit, 8-bit, JsonGenerator): 60,8 / 54,7 / 83,5 msecs
Length 256, end escape: (7-bit, 8-bit, JsonGenerator): 61,9 / 51,7 / 138,7 msecs
Length 512, none escape: (7-bit, 8-bit, JsonGenerator): 59,5 / 50,9 / 56,5 msecs
Length 512, start escape: (7-bit, 8-bit, JsonGenerator): 61,6 / 53,3 / 79,3 msecs
Length 512, end escape: (7-bit, 8-bit, JsonGenerator): 60,8 / 50,1 / 132,9 msecs
Length 1024, none escape: (7-bit, 8-bit, JsonGenerator): 60,2 / 50,7 / 95,5 msecs
Length 1024, start escape: (7-bit, 8-bit, JsonGenerator): 60,3 / 52,5 / 78,7 msecs
Length 1024, end escape: (7-bit, 8-bit, JsonGenerator): 60,6 / 50,1 / 97,0 msecs
Length 8192, none escape: (7-bit, 8-bit, JsonGenerator): 58,0 / 49,0 / 32,4 msecs
Length 8192, start escape: (7-bit, 8-bit, JsonGenerator): 59,1 / 49,0 / 38,1 msecs
Length 8192, end escape: (7-bit, 8-bit, JsonGenerator): 58,9 / 48,9 / 34,6 msecs
before:
Length 8, none escape: (7-bit, 8-bit, JsonGenerator): 58,8 / 45,4 / 196,7 msecs
Length 8, start escape: (7-bit, 8-bit, JsonGenerator): 84,9 / 76,1 / 201,4 msecs
Length 8, end escape: (7-bit, 8-bit, JsonGenerator): 74,2 / 70,7 / 230,7 msecs
Length 16, none escape: (7-bit, 8-bit, JsonGenerator): 65,2 / 56,1 / 121,2 msecs
Length 16, start escape: (7-bit, 8-bit, JsonGenerator): 74,0 / 62,3 / 133,6 msecs
Length 16, end escape: (7-bit, 8-bit, JsonGenerator): 67,8 / 59,3 / 178,4 msecs
Length 32, none escape: (7-bit, 8-bit, JsonGenerator): 62,9 / 52,7 / 87,1 msecs
Length 32, start escape: (7-bit, 8-bit, JsonGenerator): 67,0 / 56,4 / 106,5 msecs
Length 32, end escape: (7-bit, 8-bit, JsonGenerator): 63,4 / 54,2 / 155,5 msecs
Length 256, none escape: (7-bit, 8-bit, JsonGenerator): 60,2 / 52,2 / 61,7 msecs
Length 256, start escape: (7-bit, 8-bit, JsonGenerator): 63,2 / 54,1 / 86,0 msecs
Length 256, end escape: (7-bit, 8-bit, JsonGenerator): 62,1 / 52,0 / 142,1 msecs
Length 512, none escape: (7-bit, 8-bit, JsonGenerator): 59,1 / 51,1 / 56,9 msecs
Length 512, start escape: (7-bit, 8-bit, JsonGenerator): 62,2 / 53,1 / 82,6 msecs
Length 512, end escape: (7-bit, 8-bit, JsonGenerator): 60,6 / 50,2 / 136,0 msecs
Length 1024, none escape: (7-bit, 8-bit, JsonGenerator): 59,9 / 50,7 / 96,4 msecs
Length 1024, start escape: (7-bit, 8-bit, JsonGenerator): 60,6 / 52,8 / 81,3 msecs
Length 1024, end escape: (7-bit, 8-bit, JsonGenerator): 60,5 / 49,7 / 97,0 msecs
Length 8192, none escape: (7-bit, 8-bit, JsonGenerator): 59,1 / 48,9 / 45,1 msecs
Length 8192, start escape: (7-bit, 8-bit, JsonGenerator): 59,7 / 49,1 / 49,5 msecs
Length 8192, end escape: (7-bit, 8-bit, JsonGenerator): 58,6 / 49,0 / 47,3 msecs
combined JsonGenerator results:
Length 8, none escape: (before, after): 196,7 / 196,7 msecs
Length 8, start escape: (before, after): 201,4 / 198,5 msecs
Length 8, end escape: (before, after): 230,7 / 231,6 msecs
Length 16, none escape: (before, after): 121,2 / 120,0 msecs
Length 16, start escape: (before, after): 133,6 / 129,5 msecs
Length 16, end escape: (before, after): 178,4 / 172,5 msecs
Length 32, none escape: (before, after): 87,1 / 85,8 msecs
Length 32, start escape: (before, after): 106,5 / 103,6 msecs
Length 32, end escape: (before, after): 155,5 / 146,5 msecs
Length 256, none escape: (before, after): 61,7 / 60,7 msecs
Length 256, start escape: (before, after): 86,0 / 83,5 msecs
Length 256, end escape: (before, after): 142,1 / 138,7 msecs
Length 512, none escape: (before, after): 56,9 / 56,5 msecs
Length 512, start escape: (before, after): 82,6 / 79,3 msecs
Length 512, end escape: (before, after): 136,0 / 132,9 msecs
Length 1024, none escape: (before, after): 96,4 / 95,5 msecs
Length 1024, start escape: (before, after): 81,3 / 78,7 msecs
Length 1024, end escape: (before, after): 97,0 / 97,0 msecs
Length 8192, none escape: (before, after): 45,1 / 32,4 msecs
Length 8192, start escape: (before, after): 49,5 / 38,1 msecs
Length 8192, end escape: (before, after): 47,3 / 34,6 msecs
It's not as big of a difference I was seeing in the original benchmarks I had, although it's noticeable how larger strings appear to benefit quite a bit.
There is potentially another question of whether UTF8JsonGenerator#_extendOutputEscapesTo8Bits should overwrite JsonGeneratorImpl#sOutputEscapes, as to avoid having to recreate the 8-bit wide LUT for each individual JsonGenerator instance. I opted not to change this since that crosses UTF8JsonGenerator's boundary into the parent class, as well as demoting JsonGeneratorImpl#sOutputEscapes to a non-final field, which feels iffy.
@JoostK First of all: thank you for contributing this! At high level this makes sense, but I do need to dig bit deeper into this when I have time -- and right now I am bit overloaded/overspread so apologies for delay there may be.
Having said that: one thing we will eventually need (if not done; apologies if it has) is to get a CLA, from here:
https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf
(needs to be sent just once for all Jackson contributions)
the usual way is to print, fill & sign, scan/photo, email to "cla" at fasterxml dot com. Once I receive it we are good to wrt merging (obv pending code review).
Thank you again; looking forward to getting this merged!
@JoostK Ok, I now understand the scope and added some suggestions. Since we are not using the new encoding table for all output, I think it's better to avoid overriding _outputEscapes.
But one thing I am wondering is whether this optimization came about actual observations of usage -- that is, if this was merged, would it help with use case you have? This is because as you say, this method seems like less commonly used and if so, benefits might be limited. But if it addresses something that at least you use, it is more reasonable to merge it.
Thanks for the comments, I haven't gotten to address them yet—nor signing the CLA, that won't be a problem but I don't typically have a printer at hand 😄.
My use-case is for UTF-8 encoded bytes that is being read from raw blobs, which are to be sent as a JSON-encoded string (it is known to be UTF-8 encoded) in a web response; this can be in the order of ~100MB of data across ~30k strings, so writeUTF8String is ideal to avoid the need to allocate+decode into a String.
Since the app will likely be deployed on X86_64, I intent to run the performance test on amd64 to gauge what the impact is there, as I suspect this may depend on the ISA (and possibly how effective a branch predictor is, so may be different across CPU vendors/generations). If the meaningful improvements are only applicable to arm64/M1 then this may not be as beneficial as I observed on macOS.
No worries @JoostK . FWTW, printing is optional; modifying PDF with info & fake signature works perfectly fine too.
And thx for sharing use case: sounds legit.
Finally found sime time to run this on Intel x64 (specifically Core i7 1270P) but the results are all over the place, so I can't draw conclusive results from it for now. Interestingly the microbenchmark consistently shows that the 7-bit approach performs better on this CPU, which is opposite of what I was measuring on arm64 (M1 Max). On the actual JsonGenerator test however I do see improvements, but those results are wildly unstable. I am a bit puzzled 🤷♂️
Not sure what this means for this PR, really. I'd really like to get stable results to make informed decisions on whether this is worth it. Maybe somebody else is able to run the test suite on amd64 CPUs?
@JoostK Ok that is... interesting. Given that code seems like it should out-perform existing implementation. I assume you tried with longer test run times but without seeing more stable results?