jackson-module-scala
jackson-module-scala copied to clipboard
Performance regression with 2.17.x
Both 2.17.0 and 2.17.0-rc1 are affected when small messages (less than 100 bytes) are parsed or serialized in multiple threads using the same object mapper instance.
The problem exists when parsing bigger messages but the performance difference is smaller.
All libraries that use the latest version of jackson library under the hood are affected: json4s, play-json, weePickle, etc.
Here are results of benchmarks that parse and serialize 62-byte and 1185-byte messages (ADTReading/ADTWriting and ArrayOfEnumADTsReading/ArrayOfEnumADTsWriting accordingly) on Intel(R) Core(TM) i7-11700 @ 2.50GHz (max 4.90 GHz) with DDR4-3200 using 16 threads:
2.16.2
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ADTReading.borer N/A thrpt 10 6794141.947 ± 28943.355 ops/s
[info] ADTReading.circe N/A thrpt 10 5733802.213 ± 20142.997 ops/s
[info] ADTReading.circeJsoniter N/A thrpt 10 6033807.062 ± 39206.180 ops/s
[info] ADTReading.jacksonScala N/A thrpt 10 8600750.285 ± 29585.508 ops/s
[info] ADTReading.json4sJackson N/A thrpt 10 1986617.545 ± 11937.239 ops/s
[info] ADTReading.json4sNative N/A thrpt 10 728301.651 ± 3179.717 ops/s
[info] ADTReading.jsoniterScala N/A thrpt 10 31056253.119 ± 162615.821 ops/s
[info] ADTReading.playJson N/A thrpt 10 2854140.465 ± 12044.125 ops/s
[info] ADTReading.playJsonJsoniter N/A thrpt 10 5930774.513 ± 12228.253 ops/s
[info] ADTReading.smithy4sJson N/A thrpt 10 8047920.054 ± 57498.071 ops/s
[info] ADTReading.sprayJson N/A thrpt 10 7559747.663 ± 37500.787 ops/s
[info] ADTReading.uPickle N/A thrpt 10 10820366.906 ± 38588.185 ops/s
[info] ADTReading.weePickle N/A thrpt 10 8520949.161 ± 38999.046 ops/s
[info] ADTReading.zioJson N/A thrpt 10 6516965.272 ± 30056.348 ops/s
[info] ADTWriting.borer N/A thrpt 10 12725701.358 ± 64383.508 ops/s
[info] ADTWriting.circe N/A thrpt 10 5702372.141 ± 123333.038 ops/s
[info] ADTWriting.circeJsoniter N/A thrpt 10 6136183.702 ± 31911.935 ops/s
[info] ADTWriting.jacksonScala N/A thrpt 10 20213415.218 ± 154415.084 ops/s
[info] ADTWriting.json4sJackson N/A thrpt 10 1970762.580 ± 5888.709 ops/s
[info] ADTWriting.json4sNative N/A thrpt 10 2129475.567 ± 10664.320 ops/s
[info] ADTWriting.jsoniterScala N/A thrpt 10 86348750.403 ± 3923732.627 ops/s
[info] ADTWriting.jsoniterScalaPrealloc N/A thrpt 10 126135815.779 ± 753875.520 ops/s
[info] ADTWriting.playJson N/A thrpt 10 7066647.975 ± 38110.136 ops/s
[info] ADTWriting.playJsonJsoniter N/A thrpt 10 10034627.023 ± 45579.652 ops/s
[info] ADTWriting.smithy4sJson N/A thrpt 10 19544469.284 ± 206715.422 ops/s
[info] ADTWriting.sprayJson N/A thrpt 10 8817363.222 ± 29355.150 ops/s
[info] ADTWriting.uPickle N/A thrpt 10 12346407.504 ± 41903.003 ops/s
[info] ADTWriting.weePickle N/A thrpt 10 15216077.687 ± 71823.446 ops/s
[info] ADTWriting.zioJson N/A thrpt 10 6696186.925 ± 37137.477 ops/s
[info] ArrayOfEnumADTsReading.borer 128 thrpt 10 1485188.033 ± 11824.058 ops/s
[info] ArrayOfEnumADTsReading.circe 128 thrpt 10 1060544.252 ± 5714.295 ops/s
[info] ArrayOfEnumADTsReading.circeJsoniter 128 thrpt 10 1078942.307 ± 2906.939 ops/s
[info] ArrayOfEnumADTsReading.jacksonScala 128 thrpt 10 1399993.550 ± 5345.969 ops/s
[info] ArrayOfEnumADTsReading.json4sJackson 128 thrpt 10 386204.100 ± 1860.096 ops/s
[info] ArrayOfEnumADTsReading.json4sNative 128 thrpt 10 312858.047 ± 3125.767 ops/s
[info] ArrayOfEnumADTsReading.jsoniterScala 128 thrpt 10 4289813.648 ± 15066.858 ops/s
[info] ArrayOfEnumADTsReading.playJson 128 thrpt 10 478973.090 ± 2376.637 ops/s
[info] ArrayOfEnumADTsReading.playJsonJsoniter 128 thrpt 10 515684.604 ± 1902.051 ops/s
[info] ArrayOfEnumADTsReading.sprayJson 128 thrpt 10 755587.471 ± 4383.413 ops/s
[info] ArrayOfEnumADTsReading.uPickle 128 thrpt 10 1242681.165 ± 2746.319 ops/s
[info] ArrayOfEnumADTsReading.weePickle 128 thrpt 10 1169063.048 ± 4494.799 ops/s
[info] ArrayOfEnumADTsReading.zioJson 128 thrpt 10 563521.111 ± 1445.949 ops/s
[info] ArrayOfEnumADTsWriting.borer 128 thrpt 10 2054275.748 ± 9991.515 ops/s
[info] ArrayOfEnumADTsWriting.circe 128 thrpt 10 1241981.451 ± 112432.709 ops/s
[info] ArrayOfEnumADTsWriting.circeJsoniter 128 thrpt 10 2522632.787 ± 16045.668 ops/s
[info] ArrayOfEnumADTsWriting.jacksonScala 128 thrpt 10 3174526.420 ± 17247.360 ops/s
[info] ArrayOfEnumADTsWriting.json4sJackson 128 thrpt 10 1069988.662 ± 5451.828 ops/s
[info] ArrayOfEnumADTsWriting.json4sNative 128 thrpt 10 387662.462 ± 1569.061 ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScala 128 thrpt 10 13584173.183 ± 65176.029 ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScalaPrealloc 128 thrpt 10 15338486.834 ± 73143.637 ops/s
[info] ArrayOfEnumADTsWriting.playJson 128 thrpt 10 2497807.549 ± 15178.768 ops/s
[info] ArrayOfEnumADTsWriting.playJsonJsoniter 128 thrpt 10 3063237.303 ± 25286.111 ops/s
[info] ArrayOfEnumADTsWriting.sprayJson 128 thrpt 10 1816238.727 ± 8867.954 ops/s
[info] ArrayOfEnumADTsWriting.uPickle 128 thrpt 10 1595086.586 ± 8474.119 ops/s
[info] ArrayOfEnumADTsWriting.weePickle 128 thrpt 10 2396877.337 ± 8272.100 ops/s
[info] ArrayOfEnumADTsWriting.zioJson 128 thrpt 10 2754854.478 ± 10817.622 ops/s
2.17.x
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ADTReading.borer N/A thrpt 10 6808060.010 ± 24121.820 ops/s
[info] ADTReading.circe N/A thrpt 10 5504991.218 ± 28924.908 ops/s
[info] ADTReading.circeJsoniter N/A thrpt 10 5965473.392 ± 24541.654 ops/s
[info] ADTReading.jacksonScala N/A thrpt 10 3880829.241 ± 14779.470 ops/s
[info] ADTReading.json4sJackson N/A thrpt 10 1684115.890 ± 17922.496 ops/s
[info] ADTReading.json4sNative N/A thrpt 10 781783.122 ± 9209.704 ops/s
[info] ADTReading.jsoniterScala N/A thrpt 10 28193676.278 ± 170866.780 ops/s
[info] ADTReading.playJson N/A thrpt 10 1861673.096 ± 32396.971 ops/s
[info] ADTReading.playJsonJsoniter N/A thrpt 10 5936651.174 ± 28715.564 ops/s
[info] ADTReading.smithy4sJson N/A thrpt 10 8183263.456 ± 78205.821 ops/s
[info] ADTReading.sprayJson N/A thrpt 10 7725596.379 ± 25148.636 ops/s
[info] ADTReading.uPickle N/A thrpt 10 10532982.352 ± 34026.274 ops/s
[info] ADTReading.weePickle N/A thrpt 10 3662160.317 ± 10805.963 ops/s
[info] ADTReading.zioJson N/A thrpt 10 6552279.264 ± 26565.487 ops/s
[info] ADTWriting.borer N/A thrpt 10 14804007.272 ± 66986.481 ops/s
[info] ADTWriting.circe N/A thrpt 10 5555829.627 ± 33717.558 ops/s
[info] ADTWriting.circeJsoniter N/A thrpt 10 6289535.722 ± 24806.934 ops/s
[info] ADTWriting.jacksonScala N/A thrpt 10 3815278.282 ± 44918.845 ops/s
[info] ADTWriting.json4sJackson N/A thrpt 10 6895.294 ± 4815.743 ops/s
[info] ADTWriting.json4sNative N/A thrpt 10 2123757.624 ± 10590.700 ops/s
[info] ADTWriting.jsoniterScala N/A thrpt 10 82359727.429 ± 3453467.534 ops/s
[info] ADTWriting.jsoniterScalaPrealloc N/A thrpt 10 134212123.633 ± 1162472.156 ops/s
[info] ADTWriting.playJson N/A thrpt 10 586570.259 ± 1309910.167 ops/s
[info] ADTWriting.playJsonJsoniter N/A thrpt 10 9828872.088 ± 63874.635 ops/s
[info] ADTWriting.smithy4sJson N/A thrpt 10 28487361.435 ± 130873.054 ops/s
[info] ADTWriting.sprayJson N/A thrpt 10 12135052.269 ± 49628.928 ops/s
[info] ADTWriting.uPickle N/A thrpt 10 10269863.568 ± 68308.485 ops/s
[info] ADTWriting.weePickle N/A thrpt 10 3263834.084 ± 8549.350 ops/s
[info] ADTWriting.zioJson N/A thrpt 10 6440516.792 ± 20713.803 ops/s
[info] ArrayOfEnumADTsReading.borer 128 thrpt 10 1267431.783 ± 6201.075 ops/s
[info] ArrayOfEnumADTsReading.circe 128 thrpt 10 1061247.664 ± 7164.785 ops/s
[info] ArrayOfEnumADTsReading.circeJsoniter 128 thrpt 10 1079132.139 ± 5715.793 ops/s
[info] ArrayOfEnumADTsReading.jacksonScala 128 thrpt 10 1396218.772 ± 6514.442 ops/s
[info] ArrayOfEnumADTsReading.json4sJackson 128 thrpt 10 384574.257 ± 2804.130 ops/s
[info] ArrayOfEnumADTsReading.json4sNative 128 thrpt 10 314105.580 ± 2377.774 ops/s
[info] ArrayOfEnumADTsReading.jsoniterScala 128 thrpt 10 4304162.519 ± 14939.710 ops/s
[info] ArrayOfEnumADTsReading.playJson 128 thrpt 10 394406.502 ± 1457.237 ops/s
[info] ArrayOfEnumADTsReading.playJsonJsoniter 128 thrpt 10 667298.395 ± 2485.742 ops/s
[info] ArrayOfEnumADTsReading.sprayJson 128 thrpt 10 764851.351 ± 4757.110 ops/s
[info] ArrayOfEnumADTsReading.uPickle 128 thrpt 10 1249665.477 ± 7025.275 ops/s
[info] ArrayOfEnumADTsReading.weePickle 128 thrpt 10 1127465.787 ± 7466.347 ops/s
[info] ArrayOfEnumADTsReading.zioJson 128 thrpt 10 563303.187 ± 2571.587 ops/s
[info] ArrayOfEnumADTsWriting.borer 128 thrpt 10 2058803.540 ± 12165.516 ops/s
[info] ArrayOfEnumADTsWriting.circe 128 thrpt 10 2011009.990 ± 12085.455 ops/s
[info] ArrayOfEnumADTsWriting.circeJsoniter 128 thrpt 10 2667427.734 ± 13161.756 ops/s
[info] ArrayOfEnumADTsWriting.jacksonScala 128 thrpt 10 2969747.276 ± 13888.207 ops/s
[info] ArrayOfEnumADTsWriting.json4sJackson 128 thrpt 10 970464.315 ± 6924.823 ops/s
[info] ArrayOfEnumADTsWriting.json4sNative 128 thrpt 10 388122.507 ± 1894.329 ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScala 128 thrpt 10 13694337.067 ± 54600.403 ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScalaPrealloc 128 thrpt 10 15482028.886 ± 63118.277 ops/s
[info] ArrayOfEnumADTsWriting.playJson 128 thrpt 10 2181855.381 ± 11481.359 ops/s
[info] ArrayOfEnumADTsWriting.playJsonJsoniter 128 thrpt 10 3063457.946 ± 22866.313 ops/s
[info] ArrayOfEnumADTsWriting.sprayJson 128 thrpt 10 1822167.265 ± 11824.459 ops/s
[info] ArrayOfEnumADTsWriting.uPickle 128 thrpt 10 2018705.050 ± 10173.067 ops/s
[info] ArrayOfEnumADTsWriting.weePickle 128 thrpt 10 2164800.857 ± 22721.277 ops/s
[info] ArrayOfEnumADTsWriting.zioJson 128 thrpt 10 2720774.650 ± 12951.514 ops/s
Steps to reproduce:
- Download binaries of async-profiler and unpack them into
/opt/async-profiler git clone https://github.com/plokhotnyuk/jsoniter-scalacd jsoniter-scalasbt -java-home /usr/lib/jvm/jdk-21 jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof "async:dir=target/async-reports;interval=1000000;output=flamegraph;libPath=/opt/async-profiler/lib/libasyncProfiler.so" -jvmArgsAppend "-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" --p size=128 -t 16 -wi 10 -i 10 ADT'
The issue exists when tested on other JVMs (GraalVM Community, Oracle GraalVM) and with other garbage collectors (G1GC, ZGC)
Here are flame-graphs of async-profiler reports for ADTWriting.jacksonScala benchmark:
2.16.2
2.17.0
For json4s and play-json libraries performance could dynamically slowing down in about 1000x times:
[info] # Warmup Iteration 1: 498097.603 ops/s
[info] # Warmup Iteration 2: 1396602.529 ops/s
[info] # Warmup Iteration 3: 1390047.351 ops/s
[info] # Warmup Iteration 4: 1397460.036 ops/s
[info] # Warmup Iteration 5: 548693.538 ops/s
[info] # Warmup Iteration 6: 467155.766 ops/s
[info] # Warmup Iteration 7: 308437.551 ops/s
[info] # Warmup Iteration 8: 299814.921 ops/s
[info] # Warmup Iteration 9: 120140.861 ops/s
[info] # Warmup Iteration 10: 16015.216 ops/s
[info] Iteration 1: 6210.625 ops/s
[info] Iteration 2: 3199.072 ops/s
[info] Iteration 3: 9214.890 ops/s
[info] Iteration 4: 6565.308 ops/s
[info] Iteration 5: 1545.670 ops/s
[info] Iteration 6: 9597.542 ops/s
[info] Iteration 7: 6781.583 ops/s
[info] Iteration 8: 12569.237 ops/s
[info] Iteration 9: 5324.412 ops/s
[info] Iteration 10: 7944.598 ops/s
Could be using of ThreadLocal for cashing of object mappers be an acceptable workaround for this issue?
Is any workaround where we can explicitly define initial size of underlying buffers to be suitable for using with other libraries like play-json and weePickle?
Thanks @plokhotnyuk
The switch away from ThreadLocal caching is the biggest change in 2.17. It can be switched back on via config.
Fyi @cowtowncoder
It can be switched back on via config.
@pjfanning Is it .recyclerPool(JsonRecyclerPools.threadLocalPool()) call for the factory builder?
Interesting. My tests with jackson-benchmarks didn't show regression for rc1 (I think I did run tests) -- possibly since they do not collect output as byte[] (writeValueAsBytes()). This was with JDK 8 tho in case that matters.
It can be switched back on via config.
@pjfanning Is it
.recyclerPool(JsonRecyclerPools.threadLocalPool())call for the factory builder?
@plokhotnyuk I'm on a day trip so only have my phone. That looks right though.
@cowtowncoder would it be fair to say that users should try to create as few ObjectMappers as possible and reuse them as much as possible? With the new default recycler, having a single ObjectMapper shared across threads might be the best option for high throughput.
Yes, ObjectMappers should always be reused, regardless of other aspects: they (or rather, (de)serializer construction) are very heavy objects, and overhead of throwing them away negates any benefits of other caching, recycling.
And for safety, reusing ObjectReaders and ObjectWriters (constructed via ObjectMapper) is recommended: their creation is very cheap as well (as instances created by same ObjectMapper still share (de)serializer caching, reuse, in thread-safe manner).
Having said that, it sounds like here reuse is done as expected so wasteful non-reuse would not be an issue.
FWTW, json4s and play-json case is most worrying: it almost suggests like buffer recycling might be growing without bound? As if something was able to "return" buffers to recyclers, but not to actually recycle such buffer instances. Otherwise lack of recycling in itself should not cause increasingly bad performance.
And one more thing: I will be off for about a week so may not be able to follow up quickly here -- but I am very interested in learning what exactly might be going on to cause this problem, and how to resolve it.
FWTW, json4s and play-json case is most worrying: it almost suggests like buffer recycling might be growing without bound? As if something was able to "return" buffers to recyclers, but not to actually recycle such buffer instances. Otherwise lack of recycling in itself should not cause increasingly bad performance.
I haven't looked at the play-json/json4s code but one possibility is that they create ObjectMappers and use them once and throw them away. They possibly don't use a JsonFactory instance to create the ObjectMappers. So every ObjectMapper could have its own JsonFactory and every JsonFactory has its own BufferRecycler instance - and this ends up hardly seeing any reuse of arrays.
With the old ThreadLocal buffer-recycler the cache is not specific to a JsonFactory instance - so you can get away with not trying to reuse the specific buffer-recycler instance.
Yes but that wouldnt explain gradually worsening behavior… itd just be overall no-recycling case. I guess itd be possible to try with no op recycler and compare numbers.
On Wed, Mar 13, 2024 at 12:09 PM PJ Fanning @.***> wrote:
FWTW, json4s and play-json case is most worrying: it almost suggests like buffer recycling might be growing without bound? As if something was able to "return" buffers to recyclers, but not to actually recycle such buffer instances. Otherwise lack of recycling in itself should not cause increasingly bad performance.
I haven't looked at the play-json/json4s code but one possibility is that they create ObjectMappers and use them once and throw them away. They possibly don't use a JsonFactory instance to create the ObjectMappers. So every ObjectMapper could have its own JsonFactory and every JsonFactory has its own BufferRecycler instance - and this ends up hardly seeing any reuse of arrays.
With the old ThreadLocal buffer-recycler the cache is not specific to a JsonFactory instance - so you can get away with not trying to reuse the specific buffer-recycler instance.
— Reply to this email directly, view it on GitHub https://github.com/FasterXML/jackson-module-scala/issues/672#issuecomment-1995433208, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAANOGKHLYHL6SIVG7EDTQLYYCP43AVCNFSM6AAAAABEUFKLHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJVGQZTGMRQHA . You are receiving this because you were mentioned.Message ID: @.***>
I fear that a lot of Jackson users just create lots of ObjectMappers and are unaware of JsonFactory. These users do not get optimal performance by not not reusing ObjectMapper. In Jackson 2.16, they got ThreadLocal buffer recycling that spanned across all ObjectMappers. In Jackson 2.17, they will be getting new BufferRecycler instances for every ObjectMapper instance - and this will be not much better than switching off the BufferRecycler - especially with ObjectMappers that are used to parse or write just 1 small file.
@cowtowncoder would it be possible to consider changing the default JsonRecyclerPools to be the sharedLockedFreePool implementation or possibly even to revert back to the threadLocal implementation? I think there will be too many users who never thought about reusing JsonFactory instances and/or ObjectMapper instances.
https://github.com/FasterXML/jackson-core/blob/5e109b2a52ec85b737f16071e31d925af8b07dbd/src/main/java/com/fasterxml/jackson/core/util/JsonRecyclerPools.java#L20-L30
The strange thing happening with json4s and play-json after 5-10 seconds working under load the allocation rate drops together with throughput while GC time reaches 1.5 seconds for 1-second iteration:
[info] # Benchmark: com.github.plokhotnyuk.jsoniter_scala.benchmark.ADTWriting.playJson
[info] # Run progress: 1.45% complete, ETA 04:17:51
[info] # Fork: 1 of 1
[info] # Warmup Iteration 1: 3013960.320 ops/s
[info] # Warmup Iteration 2: 789488.790 ops/s
[info] # Warmup Iteration 3: 139809.193 ops/s
[info] # Warmup Iteration 4: 32370.404 ops/s
[info] # Warmup Iteration 5: 11278.130 ops/s
[info] Iteration 1: 3290.532 ops/s
[info] gc.alloc.rate: 15.960 MB/sec
[info] gc.alloc.rate.norm: 6066.932 B/op
[info] gc.count: 36.000 counts
[info] gc.time: 1541.000 ms
[info] Iteration 2: 2347.943 ops/s
[info] gc.alloc.rate: 13.090 MB/sec
[info] gc.alloc.rate.norm: 6566.901 B/op
[info] gc.count: 38.000 counts
[info] gc.time: 1644.000 ms
[info] Iteration 3: 4864.039 ops/s
[info] gc.alloc.rate: 27.263 MB/sec
[info] gc.alloc.rate.norm: 6985.689 B/op
[info] gc.count: 37.000 counts
[info] gc.time: 1568.000 ms
[info] Iteration 4: 5223.979 ops/s
[info] gc.alloc.rate: 28.205 MB/sec
[info] gc.alloc.rate.norm: 7294.109 B/op
[info] gc.count: 37.000 counts
[info] gc.time: 1607.000 ms
[info] Iteration 5: 3428.775 ops/s
[info] gc.alloc.rate: 15.688 MB/sec
[info] gc.alloc.rate.norm: 6146.888 B/op
[info] gc.count: 36.000 counts
[info] gc.time: 1581.000 ms
Here is how asyn-profiler's flame graph looks for that case:
@cowtowncoder would it be possible to consider changing the default JsonRecyclerPools to be the sharedLockedFreePool implementation or possibly even to revert back to the threadLocal implementation? I think there will be too many users who never thought about reusing JsonFactory instances and/or ObjectMapper instances.
https://github.com/FasterXML/jackson-core/blob/5e109b2a52ec85b737f16071e31d925af8b07dbd/src/main/java/com/fasterxml/jackson/core/util/JsonRecyclerPools.java#L20-L30
While revert to previous default is possible, I would really like to understand mechanics here.
Not reusing JsonFactory should not be catastrophic even if sub-optimal: it should be similar to just not having buffer recycling (pool instances are quite light-weight).
Conversely not reusing ObjectMapper should be so inefficient that recycling via ThreadLocal could not make things fast no matter what; it should already have been woefully inefficient.
(much more so than not reusing JsonFactory).
So I feel like I am missing some piece of the puzzle here; something more seems to be going on.
@cowtowncoder the ThreadLocal buffer recycler instance is shared. It doesn't matter if users reuse ObjectMappers or not - they get a shared buffer recycler. With the new Jackson 2.17 default, unless a user knows about creating JsonFactories and reusing those - and lots of users have no idea about this - the ObjectMapper gets its own JsonFactory and its own buffer recycler instance. This buffer recycler instance will not create new arrays. There is a good chance that with small docs that few if any arrays will be reused.
I'm afraid I'm busy with settting up some new Apache projects so I don't really have any time to debug this.
@plokhotnyuk I notice that you changed the Json4s benchmark to try to use the ThreadLocal buffer recycler. Does this help to get the performance back to where it was before Jackson 2.17?
https://github.com/plokhotnyuk/jsoniter-scala/blob/58f2ef44fe6549bfeff60160d5094f2282d640ac/jsoniter-scala-benchmark/shared/src/main/scala/com/github/plokhotnyuk/jsoniter_scala/benchmark/Json4sFormats.scala#L166
@pjfanning Yes I fully understand ThreadLocal-based pool is shared regardless of reuse of JsonFactory.
My point is that I am not convinced this reuse in itself would explain all of observed difference, even for small content. It definitely would help in such case.
Because... if ObjectMappers are not reused, performance can't be good -- even in "least bad" case where ObjectMapper is only used to read JsonNode (mapper.readTree())`, for smallish payloads throughput is at least 50% lower than with reuse.
So: yes, for non-reuse case, ThreadLocal pool will work better. But does that really explain drastic difference? These are just simple byte[] and/or char[] instances, not heavy-weight objects or processing/
So I do not want to reflexively assume that this is the main or only problem here, revert, call it good. Maybe tests by @plokhotnyuk can shed more light on this.
I am not against changing the default pool for 2.x version(s) if this seems necessary, fwtw.
@plokhotnyuk I was wondering if you were able to test out using
.recyclerPool(JsonRecyclerPools.threadLocalPool())
to configure pool back to pre-2.17 one?
Also: I was not clear on whether ObjectMapper instances were reused or not (or, if using JsonFactory level, JsonFactory instances)?
I am seriously considering switching recycling pool default back to ThreadLocal-based one for 2.17.1:
https://github.com/FasterXML/jackson-core/issues/1256
but would want to have confidence it would make positive difference here.
Unfortunately I haven't really been able to decipher flame graphs to help have an idea of exact details of performance degradation.
One further data point: Trino issue https://github.com/trinodb/trino/issues/21356 does suggest that the issue would indeed be retaining of huge amount of buffers and not (I think) so much that there is no JsonFactory reuse -- how else would it be possible for a Pool to keep on growing, unless somehow reuse failed (so new buffer allocated) but release to pool succeeded?
This is what bothers me most, possibility of unbounded growth.
But how could this happen?
@plokhotnyuk I was wondering if you were able to test out using
.recyclerPool(JsonRecyclerPools.threadLocalPool())to configure pool back to pre-2.17 one?
Also: I was not clear on whether
ObjectMapperinstances were reused or not (or, if usingJsonFactorylevel, JsonFactory instances)? I am seriously considering switching recycling pool default back to ThreadLocal-based one for 2.17.1:but would want to have confidence it would make positive difference here.
Unfortunately I haven't really been able to decipher flame graphs to help have an idea of exact details of performance degradation.
Currently all jackson-based benchmarks from jsoniter-scala-benchmark module were migrated to use JsonRecyclerPools.threadLocalPool() and one instance of ObjectMapper. That fixes speed of small message parsing using multiple threads.
No more 1000x times performance regressions when running parsing or serialization using 16 threads. Here is a comparison of benchmark results using JDK 21 before upgrade to 2.17.0 and after with mitigation applied, see jacksonScala, json4sJackson, playJson, weePickle
@plokhotnyuk Ok so it seems fair to say that switching back to ThreadLocal recycler restored performance to pre-2.17 stage. Good. Thank you for confirming.
@plokhotnyuk Ok, based on investigation on reproducing the issue there is some suspicious behavior by JsonRecyclerPools.newLockFreePool() -- oddly it seems most likely to occur very soon on test run and on tests situation also stabilizes. But there definitely is something. So I think it is necessary to use another implementation.
My thinking now is to revert back to threadLocalPool() for 2.17.x, but for 2.18 I think we'd switch default to JsonRecyclerPools.newConcurrentDequePool().
Would it be possible to test what use of JsonRecyclerPools.newConcurrentDequePool() would do with your benchmark? It would be a good sanity check as I haven't seen anomalous behavior with it on my test -- but my test's behavior itself is different from benchmark here so that's not sufficient proof.
@cowtowncoder I've prepared a branch with the pool type replacement for all JSON parsers that use jackson-core under the hood and will run same benchmarks on the same environment this weekend: https://github.com/plokhotnyuk/jsoniter-scala/compare/master...new-concurrent-deque-pool
While full tests on desktop are running still, I tested JsonRecyclerPools.newConcurrentDequePool() for jackson-module-scala benchmarks on my laptop Intel® Core™ i7-11800H CPU @ 2.3GHz (max 4.6GHz), RAM 64Gb DDR4-3200 using 16 threads and async-profiler:
sbt -java-home /usr/lib/jvm/jdk-21 jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof "async:dir=target/async-reports;interval=1000000;output=flamegraph;libPath=/opt/async-profiler/lib/libasyncProfiler.so" -jvmArgsAppend "-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" --p size=128 -wi 5 -i 5 -t 16 jacksonScala'
For small (less than 100 bytes) messages up to 90% of CPU cycles are spent on acquiring, releasing and allocating byte buffers.
I've highlighted them in the following flamegraph:
I ran the same ADTWriting.jacksonScala test with a few different Jackson versions and tried a few different pool implementations with Jackson 2.17.0
- 2.17.1-SNAPSHOT already has some changes that cause a small slowdown in ADTWriting (with ThreadLocal buffer recycler).
- 2.18.0-SNAPSHOT is faster than 2.17.x and 2.16.x (when using ThreadLocal buffer recycler).
- Among the different pool implementations, ThreadLocal is faster than SharedConcurrentDeque which is faster than NewConcurrentDeque.
- I later ran with '-t 3' (3 threads). The pool order was the same but ThreadLocal was much better than the 2 ConcurrentDeque implementations with multiple threads.
sbt jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof gc .*ADTWriting.jacksonScala'
2024-04-20 Default jsoniter-scala (Jackson 2.17.0 with ThreadLocal buffer recycler)
[info] Benchmark Mode Cnt Score Error Units
[info] ADTWriting.jacksonScala thrpt 5 2290252.179 ± 107995.529 ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate thrpt 5 1222.806 ± 57.697 MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm thrpt 5 560.000 ± 0.001 B/op
[info] ADTWriting.jacksonScala:gc.count thrpt 5 8.000 counts
[info] ADTWriting.jacksonScala:gc.time thrpt 5 7.000
2024-04-20 jsoniter-scala changed to use Jackson 2.16.2 with ThreadLocal buffer recycler)
[info] Benchmark Mode Cnt Score Error Units
[info] ADTWriting.jacksonScala thrpt 5 2225989.506 ± 160118.600 ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate thrpt 5 1188.489 ± 85.712 MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm thrpt 5 560.000 ± 0.001 B/op
[info] ADTWriting.jacksonScala:gc.count thrpt 5 7.000 counts
[info] ADTWriting.jacksonScala:gc.time thrpt 5 7.000
2024-04-20 jsoniter-scala changed to use Jackson 2.17.1-SNAPSHOT with ThreadLocal buffer recycler)
[info] Benchmark Mode Cnt Score Error Units
[info] ADTWriting.jacksonScala thrpt 5 2099028.720 ± 156226.482 ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate thrpt 5 1120.687 ± 83.382 MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm thrpt 5 560.000 ± 0.001 B/op
[info] ADTWriting.jacksonScala:gc.count thrpt 5 8.000 counts
[info] ADTWriting.jacksonScala:gc.time thrpt 5 7.000
2024-04-20 Default jsoniter-scala (Jackson 2.17.0 with NewConcurrentDeque buffer recycler)
[info] Benchmark Mode Cnt Score Error Units
[info] ADTWriting.jacksonScala thrpt 5 1686594.004 ± 213058.648 ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate thrpt 5 1093.464 ± 137.984 MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm thrpt 5 680.000 ± 0.001 B/op
[info] ADTWriting.jacksonScala:gc.count thrpt 5 7.000 counts
[info] ADTWriting.jacksonScala:gc.time thrpt 5 5.000
2024-04-20 Default jsoniter-scala (Jackson 2.17.0 with SharedConcurrentDeque buffer recycler)
[info] Benchmark Mode Cnt Score Error Units
[info] ADTWriting.jacksonScala thrpt 5 1981627.720 ± 242523.449 ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate thrpt 5 1103.358 ± 135.033 MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm thrpt 5 584.000 ± 0.001 B/op
[info] ADTWriting.jacksonScala:gc.count thrpt 5 7.000 counts
[info] ADTWriting.jacksonScala:gc.time thrpt 5 7.000
@cowtowncoder @pjfanning JsonRecyclerPools.newConcurrentDequePool() pool is too expansive for small (less than 1Kb) messages, especially in the multithread environment.
Here are comparisons of benchmark results between threadLocalPool() and newConcurrentDequePool() mitigations for the recycler pool performance regression jackson-core 2.17.0 on different JVMs using 16 threads (move Ignoring deviations below 0% slider to 0 and scroll down to Declined benchmarks section):
GraalVM JDK 22 GraalVM JDK 21 GraalVM JDK 17 GraalVM Community JDK 23 GraalVM Community JDK 21 JDK 23 JDK 21 JDK 17
@pjfanning Differences in your runs seem much more modest than what @plokhotnyuk reports. I wonder if that is due to lower thread count (less contention) or is to output size relatively bigge?
@plokhotnyuk I am still struggling to reconcile your results with understanding of what the specific reason for drastically lower performance is. It seems to me there are 3 possible causes:
- Contention (locking/blocking) of threads when acquiring and releasing buffers. I could see this being possible with really high thread counts and small content (because time between acquire/release calls is very short)
- Some implementation flaw that prevents successful reuse. If so, we'd see more allocations with concurrentDequePool implementation compared to thread local. And numbers would be similar to non-reusing "pool" implementation
- If there is no reuse of underlying
JsonFactory-- my understanding is this has been ruled out as possibility.
If it is (1) that explains the problem, it would suggest there is not much to do, and default for traditional thread-case should remain ThreadLocalPool. And since Jackson should ideally not be in the business of auto-detecting runtime platform (or users' thread model choices), probably should be THE default pool, only to be overridden by developers/frameworks that know better.
But if there is any chance (2) could be the case it'd be good to figure out why reuse was not working.
@pjfanning I am bit surprised by there being any different between "shared" and "new" pools -- when a single JsonFactory is reused there should be no difference at all (there's just a single pool used by the factory).
@plokhotnyuk Unfortunately I can't see the expanded flamegraph: link to https://private-user-images.githubusercontent.com/890289/324165548-d40280ad-ed4d-4468-b4ab-c5f2de528fce.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQxMDY2NDYsIm5iZiI6MTcxNDEwNjM0NiwicGF0aCI6Ii84OTAyODkvMzI0MTY1NTQ4LWQ0MDI4MGFkLWVkNGQtNDQ2OC1iNGFiLWM1ZjJkZTUyOGZjZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNDI2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDQyNlQwNDM5MDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0yMWEzN2FmYjJlZjUzNzIwYjJiOWE0Mjk0ZTM5OTAyZmExZTU5ZTNiOGNmM2JkYWM5MDk3MTQ2M2M2YTZkZDdmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.4v64TVkhq9yp5Mgo2zqNc5hBCtcaa7ZDk_fPBTP16KA shows up empty.
But on " 90% of CPU cycles are spent on acquiring, releasing and allocating byte buffers." (on 16 threads, with 100 byte write (or read?)): I would assume that "allocating" part should be negligible, since if not, it would mean that recycling did not actually work: with reused JsonFactory, pool size would eventually settle for larger number of buffers used and no new allocations should occur.
So all the overhead should be due other 2 things (acquiring and releasing), due to contention. Is this what you see?
If not I wonder how and why recycling itself did not work.
@plokhotnyuk Thank you for links! Unfortunately I am not quite sure how to read them, like say https://jmh.morethan.io/?sources=https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages/jdk-17-t16.json,https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages-jackson-recycler/jdk-17-t16.json (JDK 17). There are many entries, differences -- so what should I be looking at? And what is the baseline against which changes are?
@plokhotnyuk Thank you for links! Unfortunately I am not quite sure how to read them, like say https://jmh.morethan.io/?sources=https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages/jdk-17-t16.json,https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages-jackson-recycler/jdk-17-t16.json (JDK 17). There are many entries, differences -- so what should I be looking at? And what is the baseline against which changes are?
Both benchmark runs are for the latest jackson-core release.
The baseline is using .recyclerPool(JsonRecyclerPools.threadLocalPool()) mitigation for all affected JSON parsers (jackson-module-scala, play-json, json4s, weePickle).
It is compared with benchmark results for replacing that mitigation by .recyclerPool(JsonRecyclerPools.newConcurrentDequePool()) as committed here
To see up to 20x times slowdown as on the screenshot below please scroll down the page to the Declined Benchmarks section: