jackson-module-scala icon indicating copy to clipboard operation
jackson-module-scala copied to clipboard

Performance regression with 2.17.x

Open plokhotnyuk opened this issue 1 year ago • 9 comments

Both 2.17.0 and 2.17.0-rc1 are affected when small messages (less than 100 bytes) are parsed or serialized in multiple threads using the same object mapper instance.

The problem exists when parsing bigger messages but the performance difference is smaller.

All libraries that use the latest version of jackson library under the hood are affected: json4s, play-json, weePickle, etc.

Here are results of benchmarks that parse and serialize 62-byte and 1185-byte messages (ADTReading/ADTWriting and ArrayOfEnumADTsReading/ArrayOfEnumADTsWriting accordingly) on Intel(R) Core(TM) i7-11700 @ 2.50GHz (max 4.90 GHz) with DDR4-3200 using 16 threads:

2.16.2

[info] Benchmark                                           (size)   Mode  Cnt          Score         Error  Units
[info] ADTReading.borer                                       N/A  thrpt   10    6794141.947 ±   28943.355  ops/s
[info] ADTReading.circe                                       N/A  thrpt   10    5733802.213 ±   20142.997  ops/s
[info] ADTReading.circeJsoniter                               N/A  thrpt   10    6033807.062 ±   39206.180  ops/s
[info] ADTReading.jacksonScala                                N/A  thrpt   10    8600750.285 ±   29585.508  ops/s
[info] ADTReading.json4sJackson                               N/A  thrpt   10    1986617.545 ±   11937.239  ops/s
[info] ADTReading.json4sNative                                N/A  thrpt   10     728301.651 ±    3179.717  ops/s
[info] ADTReading.jsoniterScala                               N/A  thrpt   10   31056253.119 ±  162615.821  ops/s
[info] ADTReading.playJson                                    N/A  thrpt   10    2854140.465 ±   12044.125  ops/s
[info] ADTReading.playJsonJsoniter                            N/A  thrpt   10    5930774.513 ±   12228.253  ops/s
[info] ADTReading.smithy4sJson                                N/A  thrpt   10    8047920.054 ±   57498.071  ops/s
[info] ADTReading.sprayJson                                   N/A  thrpt   10    7559747.663 ±   37500.787  ops/s
[info] ADTReading.uPickle                                     N/A  thrpt   10   10820366.906 ±   38588.185  ops/s
[info] ADTReading.weePickle                                   N/A  thrpt   10    8520949.161 ±   38999.046  ops/s
[info] ADTReading.zioJson                                     N/A  thrpt   10    6516965.272 ±   30056.348  ops/s
[info] ADTWriting.borer                                       N/A  thrpt   10   12725701.358 ±   64383.508  ops/s
[info] ADTWriting.circe                                       N/A  thrpt   10    5702372.141 ±  123333.038  ops/s
[info] ADTWriting.circeJsoniter                               N/A  thrpt   10    6136183.702 ±   31911.935  ops/s
[info] ADTWriting.jacksonScala                                N/A  thrpt   10   20213415.218 ±  154415.084  ops/s
[info] ADTWriting.json4sJackson                               N/A  thrpt   10    1970762.580 ±    5888.709  ops/s
[info] ADTWriting.json4sNative                                N/A  thrpt   10    2129475.567 ±   10664.320  ops/s
[info] ADTWriting.jsoniterScala                               N/A  thrpt   10   86348750.403 ± 3923732.627  ops/s
[info] ADTWriting.jsoniterScalaPrealloc                       N/A  thrpt   10  126135815.779 ±  753875.520  ops/s
[info] ADTWriting.playJson                                    N/A  thrpt   10    7066647.975 ±   38110.136  ops/s
[info] ADTWriting.playJsonJsoniter                            N/A  thrpt   10   10034627.023 ±   45579.652  ops/s
[info] ADTWriting.smithy4sJson                                N/A  thrpt   10   19544469.284 ±  206715.422  ops/s
[info] ADTWriting.sprayJson                                   N/A  thrpt   10    8817363.222 ±   29355.150  ops/s
[info] ADTWriting.uPickle                                     N/A  thrpt   10   12346407.504 ±   41903.003  ops/s
[info] ADTWriting.weePickle                                   N/A  thrpt   10   15216077.687 ±   71823.446  ops/s
[info] ADTWriting.zioJson                                     N/A  thrpt   10    6696186.925 ±   37137.477  ops/s
[info] ArrayOfEnumADTsReading.borer                           128  thrpt   10    1485188.033 ±   11824.058  ops/s
[info] ArrayOfEnumADTsReading.circe                           128  thrpt   10    1060544.252 ±    5714.295  ops/s
[info] ArrayOfEnumADTsReading.circeJsoniter                   128  thrpt   10    1078942.307 ±    2906.939  ops/s
[info] ArrayOfEnumADTsReading.jacksonScala                    128  thrpt   10    1399993.550 ±    5345.969  ops/s
[info] ArrayOfEnumADTsReading.json4sJackson                   128  thrpt   10     386204.100 ±    1860.096  ops/s
[info] ArrayOfEnumADTsReading.json4sNative                    128  thrpt   10     312858.047 ±    3125.767  ops/s
[info] ArrayOfEnumADTsReading.jsoniterScala                   128  thrpt   10    4289813.648 ±   15066.858  ops/s
[info] ArrayOfEnumADTsReading.playJson                        128  thrpt   10     478973.090 ±    2376.637  ops/s
[info] ArrayOfEnumADTsReading.playJsonJsoniter                128  thrpt   10     515684.604 ±    1902.051  ops/s
[info] ArrayOfEnumADTsReading.sprayJson                       128  thrpt   10     755587.471 ±    4383.413  ops/s
[info] ArrayOfEnumADTsReading.uPickle                         128  thrpt   10    1242681.165 ±    2746.319  ops/s
[info] ArrayOfEnumADTsReading.weePickle                       128  thrpt   10    1169063.048 ±    4494.799  ops/s
[info] ArrayOfEnumADTsReading.zioJson                         128  thrpt   10     563521.111 ±    1445.949  ops/s
[info] ArrayOfEnumADTsWriting.borer                           128  thrpt   10    2054275.748 ±    9991.515  ops/s
[info] ArrayOfEnumADTsWriting.circe                           128  thrpt   10    1241981.451 ±  112432.709  ops/s
[info] ArrayOfEnumADTsWriting.circeJsoniter                   128  thrpt   10    2522632.787 ±   16045.668  ops/s
[info] ArrayOfEnumADTsWriting.jacksonScala                    128  thrpt   10    3174526.420 ±   17247.360  ops/s
[info] ArrayOfEnumADTsWriting.json4sJackson                   128  thrpt   10    1069988.662 ±    5451.828  ops/s
[info] ArrayOfEnumADTsWriting.json4sNative                    128  thrpt   10     387662.462 ±    1569.061  ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScala                   128  thrpt   10   13584173.183 ±   65176.029  ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScalaPrealloc           128  thrpt   10   15338486.834 ±   73143.637  ops/s
[info] ArrayOfEnumADTsWriting.playJson                        128  thrpt   10    2497807.549 ±   15178.768  ops/s
[info] ArrayOfEnumADTsWriting.playJsonJsoniter                128  thrpt   10    3063237.303 ±   25286.111  ops/s
[info] ArrayOfEnumADTsWriting.sprayJson                       128  thrpt   10    1816238.727 ±    8867.954  ops/s
[info] ArrayOfEnumADTsWriting.uPickle                         128  thrpt   10    1595086.586 ±    8474.119  ops/s
[info] ArrayOfEnumADTsWriting.weePickle                       128  thrpt   10    2396877.337 ±    8272.100  ops/s
[info] ArrayOfEnumADTsWriting.zioJson                         128  thrpt   10    2754854.478 ±   10817.622  ops/s

2.17.x

[info] Benchmark                                           (size)   Mode  Cnt          Score         Error  Units
[info] ADTReading.borer                                       N/A  thrpt   10    6808060.010 ±   24121.820  ops/s
[info] ADTReading.circe                                       N/A  thrpt   10    5504991.218 ±   28924.908  ops/s
[info] ADTReading.circeJsoniter                               N/A  thrpt   10    5965473.392 ±   24541.654  ops/s
[info] ADTReading.jacksonScala                                N/A  thrpt   10    3880829.241 ±   14779.470  ops/s
[info] ADTReading.json4sJackson                               N/A  thrpt   10    1684115.890 ±   17922.496  ops/s
[info] ADTReading.json4sNative                                N/A  thrpt   10     781783.122 ±    9209.704  ops/s
[info] ADTReading.jsoniterScala                               N/A  thrpt   10   28193676.278 ±  170866.780  ops/s
[info] ADTReading.playJson                                    N/A  thrpt   10    1861673.096 ±   32396.971  ops/s
[info] ADTReading.playJsonJsoniter                            N/A  thrpt   10    5936651.174 ±   28715.564  ops/s
[info] ADTReading.smithy4sJson                                N/A  thrpt   10    8183263.456 ±   78205.821  ops/s
[info] ADTReading.sprayJson                                   N/A  thrpt   10    7725596.379 ±   25148.636  ops/s
[info] ADTReading.uPickle                                     N/A  thrpt   10   10532982.352 ±   34026.274  ops/s
[info] ADTReading.weePickle                                   N/A  thrpt   10    3662160.317 ±   10805.963  ops/s
[info] ADTReading.zioJson                                     N/A  thrpt   10    6552279.264 ±   26565.487  ops/s
[info] ADTWriting.borer                                       N/A  thrpt   10   14804007.272 ±   66986.481  ops/s
[info] ADTWriting.circe                                       N/A  thrpt   10    5555829.627 ±   33717.558  ops/s
[info] ADTWriting.circeJsoniter                               N/A  thrpt   10    6289535.722 ±   24806.934  ops/s
[info] ADTWriting.jacksonScala                                N/A  thrpt   10    3815278.282 ±   44918.845  ops/s
[info] ADTWriting.json4sJackson                               N/A  thrpt   10       6895.294 ±    4815.743  ops/s
[info] ADTWriting.json4sNative                                N/A  thrpt   10    2123757.624 ±   10590.700  ops/s
[info] ADTWriting.jsoniterScala                               N/A  thrpt   10   82359727.429 ± 3453467.534  ops/s
[info] ADTWriting.jsoniterScalaPrealloc                       N/A  thrpt   10  134212123.633 ± 1162472.156  ops/s
[info] ADTWriting.playJson                                    N/A  thrpt   10     586570.259 ± 1309910.167  ops/s
[info] ADTWriting.playJsonJsoniter                            N/A  thrpt   10    9828872.088 ±   63874.635  ops/s
[info] ADTWriting.smithy4sJson                                N/A  thrpt   10   28487361.435 ±  130873.054  ops/s
[info] ADTWriting.sprayJson                                   N/A  thrpt   10   12135052.269 ±   49628.928  ops/s
[info] ADTWriting.uPickle                                     N/A  thrpt   10   10269863.568 ±   68308.485  ops/s
[info] ADTWriting.weePickle                                   N/A  thrpt   10    3263834.084 ±    8549.350  ops/s
[info] ADTWriting.zioJson                                     N/A  thrpt   10    6440516.792 ±   20713.803  ops/s
[info] ArrayOfEnumADTsReading.borer                           128  thrpt   10    1267431.783 ±    6201.075  ops/s
[info] ArrayOfEnumADTsReading.circe                           128  thrpt   10    1061247.664 ±    7164.785  ops/s
[info] ArrayOfEnumADTsReading.circeJsoniter                   128  thrpt   10    1079132.139 ±    5715.793  ops/s
[info] ArrayOfEnumADTsReading.jacksonScala                    128  thrpt   10    1396218.772 ±    6514.442  ops/s
[info] ArrayOfEnumADTsReading.json4sJackson                   128  thrpt   10     384574.257 ±    2804.130  ops/s
[info] ArrayOfEnumADTsReading.json4sNative                    128  thrpt   10     314105.580 ±    2377.774  ops/s
[info] ArrayOfEnumADTsReading.jsoniterScala                   128  thrpt   10    4304162.519 ±   14939.710  ops/s
[info] ArrayOfEnumADTsReading.playJson                        128  thrpt   10     394406.502 ±    1457.237  ops/s
[info] ArrayOfEnumADTsReading.playJsonJsoniter                128  thrpt   10     667298.395 ±    2485.742  ops/s
[info] ArrayOfEnumADTsReading.sprayJson                       128  thrpt   10     764851.351 ±    4757.110  ops/s
[info] ArrayOfEnumADTsReading.uPickle                         128  thrpt   10    1249665.477 ±    7025.275  ops/s
[info] ArrayOfEnumADTsReading.weePickle                       128  thrpt   10    1127465.787 ±    7466.347  ops/s
[info] ArrayOfEnumADTsReading.zioJson                         128  thrpt   10     563303.187 ±    2571.587  ops/s
[info] ArrayOfEnumADTsWriting.borer                           128  thrpt   10    2058803.540 ±   12165.516  ops/s
[info] ArrayOfEnumADTsWriting.circe                           128  thrpt   10    2011009.990 ±   12085.455  ops/s
[info] ArrayOfEnumADTsWriting.circeJsoniter                   128  thrpt   10    2667427.734 ±   13161.756  ops/s
[info] ArrayOfEnumADTsWriting.jacksonScala                    128  thrpt   10    2969747.276 ±   13888.207  ops/s
[info] ArrayOfEnumADTsWriting.json4sJackson                   128  thrpt   10     970464.315 ±    6924.823  ops/s
[info] ArrayOfEnumADTsWriting.json4sNative                    128  thrpt   10     388122.507 ±    1894.329  ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScala                   128  thrpt   10   13694337.067 ±   54600.403  ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScalaPrealloc           128  thrpt   10   15482028.886 ±   63118.277  ops/s
[info] ArrayOfEnumADTsWriting.playJson                        128  thrpt   10    2181855.381 ±   11481.359  ops/s
[info] ArrayOfEnumADTsWriting.playJsonJsoniter                128  thrpt   10    3063457.946 ±   22866.313  ops/s
[info] ArrayOfEnumADTsWriting.sprayJson                       128  thrpt   10    1822167.265 ±   11824.459  ops/s
[info] ArrayOfEnumADTsWriting.uPickle                         128  thrpt   10    2018705.050 ±   10173.067  ops/s
[info] ArrayOfEnumADTsWriting.weePickle                       128  thrpt   10    2164800.857 ±   22721.277  ops/s
[info] ArrayOfEnumADTsWriting.zioJson                         128  thrpt   10    2720774.650 ±   12951.514  ops/s

Steps to reproduce:

  1. Download binaries of async-profiler and unpack them into /opt/async-profiler
  2. git clone https://github.com/plokhotnyuk/jsoniter-scala
  3. cd jsoniter-scala
  4. sbt -java-home /usr/lib/jvm/jdk-21 jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof "async:dir=target/async-reports;interval=1000000;output=flamegraph;libPath=/opt/async-profiler/lib/libasyncProfiler.so" -jvmArgsAppend "-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" --p size=128 -t 16 -wi 10 -i 10 ADT'

The issue exists when tested on other JVMs (GraalVM Community, Oracle GraalVM) and with other garbage collectors (G1GC, ZGC)

Here are flame-graphs of async-profiler reports for ADTWriting.jacksonScala benchmark:

2.16.2

image

2.17.0

image

For json4s and play-json libraries performance could dynamically slowing down in about 1000x times:

[info] # Warmup Iteration   1: 498097.603 ops/s
[info] # Warmup Iteration   2: 1396602.529 ops/s
[info] # Warmup Iteration   3: 1390047.351 ops/s
[info] # Warmup Iteration   4: 1397460.036 ops/s
[info] # Warmup Iteration   5: 548693.538 ops/s
[info] # Warmup Iteration   6: 467155.766 ops/s
[info] # Warmup Iteration   7: 308437.551 ops/s
[info] # Warmup Iteration   8: 299814.921 ops/s
[info] # Warmup Iteration   9: 120140.861 ops/s
[info] # Warmup Iteration  10: 16015.216 ops/s
[info] Iteration   1: 6210.625 ops/s
[info] Iteration   2: 3199.072 ops/s
[info] Iteration   3: 9214.890 ops/s
[info] Iteration   4: 6565.308 ops/s
[info] Iteration   5: 1545.670 ops/s
[info] Iteration   6: 9597.542 ops/s
[info] Iteration   7: 6781.583 ops/s
[info] Iteration   8: 12569.237 ops/s
[info] Iteration   9: 5324.412 ops/s
[info] Iteration  10: 7944.598 ops/s

Could be using of ThreadLocal for cashing of object mappers be an acceptable workaround for this issue?

Is any workaround where we can explicitly define initial size of underlying buffers to be suitable for using with other libraries like play-json and weePickle?

plokhotnyuk avatar Mar 13 '24 13:03 plokhotnyuk

Thanks @plokhotnyuk

The switch away from ThreadLocal caching is the biggest change in 2.17. It can be switched back on via config.

Fyi @cowtowncoder

pjfanning avatar Mar 13 '24 15:03 pjfanning

It can be switched back on via config.

@pjfanning Is it .recyclerPool(JsonRecyclerPools.threadLocalPool()) call for the factory builder?

plokhotnyuk avatar Mar 13 '24 15:03 plokhotnyuk

Interesting. My tests with jackson-benchmarks didn't show regression for rc1 (I think I did run tests) -- possibly since they do not collect output as byte[] (writeValueAsBytes()). This was with JDK 8 tho in case that matters.

cowtowncoder avatar Mar 13 '24 16:03 cowtowncoder

It can be switched back on via config.

@pjfanning Is it .recyclerPool(JsonRecyclerPools.threadLocalPool()) call for the factory builder?

@plokhotnyuk I'm on a day trip so only have my phone. That looks right though.

@cowtowncoder would it be fair to say that users should try to create as few ObjectMappers as possible and reuse them as much as possible? With the new default recycler, having a single ObjectMapper shared across threads might be the best option for high throughput.

pjfanning avatar Mar 13 '24 16:03 pjfanning

Yes, ObjectMappers should always be reused, regardless of other aspects: they (or rather, (de)serializer construction) are very heavy objects, and overhead of throwing them away negates any benefits of other caching, recycling.

And for safety, reusing ObjectReaders and ObjectWriters (constructed via ObjectMapper) is recommended: their creation is very cheap as well (as instances created by same ObjectMapper still share (de)serializer caching, reuse, in thread-safe manner).

Having said that, it sounds like here reuse is done as expected so wasteful non-reuse would not be an issue.

cowtowncoder avatar Mar 13 '24 18:03 cowtowncoder

FWTW, json4s and play-json case is most worrying: it almost suggests like buffer recycling might be growing without bound? As if something was able to "return" buffers to recyclers, but not to actually recycle such buffer instances. Otherwise lack of recycling in itself should not cause increasingly bad performance.

cowtowncoder avatar Mar 13 '24 18:03 cowtowncoder

And one more thing: I will be off for about a week so may not be able to follow up quickly here -- but I am very interested in learning what exactly might be going on to cause this problem, and how to resolve it.

cowtowncoder avatar Mar 13 '24 18:03 cowtowncoder

FWTW, json4s and play-json case is most worrying: it almost suggests like buffer recycling might be growing without bound? As if something was able to "return" buffers to recyclers, but not to actually recycle such buffer instances. Otherwise lack of recycling in itself should not cause increasingly bad performance.

I haven't looked at the play-json/json4s code but one possibility is that they create ObjectMappers and use them once and throw them away. They possibly don't use a JsonFactory instance to create the ObjectMappers. So every ObjectMapper could have its own JsonFactory and every JsonFactory has its own BufferRecycler instance - and this ends up hardly seeing any reuse of arrays.

With the old ThreadLocal buffer-recycler the cache is not specific to a JsonFactory instance - so you can get away with not trying to reuse the specific buffer-recycler instance.

pjfanning avatar Mar 13 '24 19:03 pjfanning

Yes but that wouldnt explain gradually worsening behavior… itd just be overall no-recycling case. I guess itd be possible to try with no op recycler and compare numbers.

On Wed, Mar 13, 2024 at 12:09 PM PJ Fanning @.***> wrote:

FWTW, json4s and play-json case is most worrying: it almost suggests like buffer recycling might be growing without bound? As if something was able to "return" buffers to recyclers, but not to actually recycle such buffer instances. Otherwise lack of recycling in itself should not cause increasingly bad performance.

I haven't looked at the play-json/json4s code but one possibility is that they create ObjectMappers and use them once and throw them away. They possibly don't use a JsonFactory instance to create the ObjectMappers. So every ObjectMapper could have its own JsonFactory and every JsonFactory has its own BufferRecycler instance - and this ends up hardly seeing any reuse of arrays.

With the old ThreadLocal buffer-recycler the cache is not specific to a JsonFactory instance - so you can get away with not trying to reuse the specific buffer-recycler instance.

— Reply to this email directly, view it on GitHub https://github.com/FasterXML/jackson-module-scala/issues/672#issuecomment-1995433208, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAANOGKHLYHL6SIVG7EDTQLYYCP43AVCNFSM6AAAAABEUFKLHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJVGQZTGMRQHA . You are receiving this because you were mentioned.Message ID: @.***>

cowtowncoder avatar Mar 13 '24 20:03 cowtowncoder

I fear that a lot of Jackson users just create lots of ObjectMappers and are unaware of JsonFactory. These users do not get optimal performance by not not reusing ObjectMapper. In Jackson 2.16, they got ThreadLocal buffer recycling that spanned across all ObjectMappers. In Jackson 2.17, they will be getting new BufferRecycler instances for every ObjectMapper instance - and this will be not much better than switching off the BufferRecycler - especially with ObjectMappers that are used to parse or write just 1 small file.

pjfanning avatar Mar 14 '24 08:03 pjfanning

@cowtowncoder would it be possible to consider changing the default JsonRecyclerPools to be the sharedLockedFreePool implementation or possibly even to revert back to the threadLocal implementation? I think there will be too many users who never thought about reusing JsonFactory instances and/or ObjectMapper instances.

https://github.com/FasterXML/jackson-core/blob/5e109b2a52ec85b737f16071e31d925af8b07dbd/src/main/java/com/fasterxml/jackson/core/util/JsonRecyclerPools.java#L20-L30

pjfanning avatar Mar 14 '24 17:03 pjfanning

The strange thing happening with json4s and play-json after 5-10 seconds working under load the allocation rate drops together with throughput while GC time reaches 1.5 seconds for 1-second iteration:

[info] # Benchmark: com.github.plokhotnyuk.jsoniter_scala.benchmark.ADTWriting.playJson
[info] # Run progress: 1.45% complete, ETA 04:17:51
[info] # Fork: 1 of 1
[info] # Warmup Iteration   1: 3013960.320 ops/s
[info] # Warmup Iteration   2: 789488.790 ops/s
[info] # Warmup Iteration   3: 139809.193 ops/s
[info] # Warmup Iteration   4: 32370.404 ops/s
[info] # Warmup Iteration   5: 11278.130 ops/s
[info] Iteration   1: 3290.532 ops/s
[info]                  gc.alloc.rate:      15.960 MB/sec
[info]                  gc.alloc.rate.norm: 6066.932 B/op
[info]                  gc.count:           36.000 counts
[info]                  gc.time:            1541.000 ms
[info] Iteration   2: 2347.943 ops/s
[info]                  gc.alloc.rate:      13.090 MB/sec
[info]                  gc.alloc.rate.norm: 6566.901 B/op
[info]                  gc.count:           38.000 counts
[info]                  gc.time:            1644.000 ms
[info] Iteration   3: 4864.039 ops/s
[info]                  gc.alloc.rate:      27.263 MB/sec
[info]                  gc.alloc.rate.norm: 6985.689 B/op
[info]                  gc.count:           37.000 counts
[info]                  gc.time:            1568.000 ms
[info] Iteration   4: 5223.979 ops/s
[info]                  gc.alloc.rate:      28.205 MB/sec
[info]                  gc.alloc.rate.norm: 7294.109 B/op
[info]                  gc.count:           37.000 counts
[info]                  gc.time:            1607.000 ms
[info] Iteration   5: 3428.775 ops/s
[info]                  gc.alloc.rate:      15.688 MB/sec
[info]                  gc.alloc.rate.norm: 6146.888 B/op
[info]                  gc.count:           36.000 counts
[info]                  gc.time:            1581.000 ms

Here is how asyn-profiler's flame graph looks for that case: image

plokhotnyuk avatar Mar 17 '24 08:03 plokhotnyuk

@cowtowncoder would it be possible to consider changing the default JsonRecyclerPools to be the sharedLockedFreePool implementation or possibly even to revert back to the threadLocal implementation? I think there will be too many users who never thought about reusing JsonFactory instances and/or ObjectMapper instances.

https://github.com/FasterXML/jackson-core/blob/5e109b2a52ec85b737f16071e31d925af8b07dbd/src/main/java/com/fasterxml/jackson/core/util/JsonRecyclerPools.java#L20-L30

While revert to previous default is possible, I would really like to understand mechanics here. Not reusing JsonFactory should not be catastrophic even if sub-optimal: it should be similar to just not having buffer recycling (pool instances are quite light-weight). Conversely not reusing ObjectMapper should be so inefficient that recycling via ThreadLocal could not make things fast no matter what; it should already have been woefully inefficient. (much more so than not reusing JsonFactory).

So I feel like I am missing some piece of the puzzle here; something more seems to be going on.

cowtowncoder avatar Mar 21 '24 03:03 cowtowncoder

@cowtowncoder the ThreadLocal buffer recycler instance is shared. It doesn't matter if users reuse ObjectMappers or not - they get a shared buffer recycler. With the new Jackson 2.17 default, unless a user knows about creating JsonFactories and reusing those - and lots of users have no idea about this - the ObjectMapper gets its own JsonFactory and its own buffer recycler instance. This buffer recycler instance will not create new arrays. There is a good chance that with small docs that few if any arrays will be reused.

I'm afraid I'm busy with settting up some new Apache projects so I don't really have any time to debug this.

pjfanning avatar Mar 21 '24 12:03 pjfanning

@plokhotnyuk I notice that you changed the Json4s benchmark to try to use the ThreadLocal buffer recycler. Does this help to get the performance back to where it was before Jackson 2.17?

https://github.com/plokhotnyuk/jsoniter-scala/blob/58f2ef44fe6549bfeff60160d5094f2282d640ac/jsoniter-scala-benchmark/shared/src/main/scala/com/github/plokhotnyuk/jsoniter_scala/benchmark/Json4sFormats.scala#L166

pjfanning avatar Mar 21 '24 12:03 pjfanning

@pjfanning Yes I fully understand ThreadLocal-based pool is shared regardless of reuse of JsonFactory.

My point is that I am not convinced this reuse in itself would explain all of observed difference, even for small content. It definitely would help in such case. Because... if ObjectMappers are not reused, performance can't be good -- even in "least bad" case where ObjectMapper is only used to read JsonNode (mapper.readTree())`, for smallish payloads throughput is at least 50% lower than with reuse.

So: yes, for non-reuse case, ThreadLocal pool will work better. But does that really explain drastic difference? These are just simple byte[] and/or char[] instances, not heavy-weight objects or processing/

So I do not want to reflexively assume that this is the main or only problem here, revert, call it good. Maybe tests by @plokhotnyuk can shed more light on this.

I am not against changing the default pool for 2.x version(s) if this seems necessary, fwtw.

cowtowncoder avatar Mar 21 '24 17:03 cowtowncoder

@plokhotnyuk I was wondering if you were able to test out using

.recyclerPool(JsonRecyclerPools.threadLocalPool())

to configure pool back to pre-2.17 one?

Also: I was not clear on whether ObjectMapper instances were reused or not (or, if using JsonFactory level, JsonFactory instances)? I am seriously considering switching recycling pool default back to ThreadLocal-based one for 2.17.1:

https://github.com/FasterXML/jackson-core/issues/1256

but would want to have confidence it would make positive difference here.

Unfortunately I haven't really been able to decipher flame graphs to help have an idea of exact details of performance degradation.

cowtowncoder avatar Apr 14 '24 19:04 cowtowncoder

One further data point: Trino issue https://github.com/trinodb/trino/issues/21356 does suggest that the issue would indeed be retaining of huge amount of buffers and not (I think) so much that there is no JsonFactory reuse -- how else would it be possible for a Pool to keep on growing, unless somehow reuse failed (so new buffer allocated) but release to pool succeeded? This is what bothers me most, possibility of unbounded growth.

But how could this happen?

cowtowncoder avatar Apr 14 '24 19:04 cowtowncoder

@plokhotnyuk I was wondering if you were able to test out using

.recyclerPool(JsonRecyclerPools.threadLocalPool())

to configure pool back to pre-2.17 one?

Also: I was not clear on whether ObjectMapper instances were reused or not (or, if using JsonFactory level, JsonFactory instances)? I am seriously considering switching recycling pool default back to ThreadLocal-based one for 2.17.1:

FasterXML/jackson-core#1256

but would want to have confidence it would make positive difference here.

Unfortunately I haven't really been able to decipher flame graphs to help have an idea of exact details of performance degradation.

Currently all jackson-based benchmarks from jsoniter-scala-benchmark module were migrated to use JsonRecyclerPools.threadLocalPool() and one instance of ObjectMapper. That fixes speed of small message parsing using multiple threads.

No more 1000x times performance regressions when running parsing or serialization using 16 threads. Here is a comparison of benchmark results using JDK 21 before upgrade to 2.17.0 and after with mitigation applied, see jacksonScala, json4sJackson, playJson, weePickle

plokhotnyuk avatar Apr 15 '24 08:04 plokhotnyuk

@plokhotnyuk Ok so it seems fair to say that switching back to ThreadLocal recycler restored performance to pre-2.17 stage. Good. Thank you for confirming.

cowtowncoder avatar Apr 16 '24 03:04 cowtowncoder

@plokhotnyuk Ok, based on investigation on reproducing the issue there is some suspicious behavior by JsonRecyclerPools.newLockFreePool() -- oddly it seems most likely to occur very soon on test run and on tests situation also stabilizes. But there definitely is something. So I think it is necessary to use another implementation.

My thinking now is to revert back to threadLocalPool() for 2.17.x, but for 2.18 I think we'd switch default to JsonRecyclerPools.newConcurrentDequePool().

Would it be possible to test what use of JsonRecyclerPools.newConcurrentDequePool() would do with your benchmark? It would be a good sanity check as I haven't seen anomalous behavior with it on my test -- but my test's behavior itself is different from benchmark here so that's not sufficient proof.

cowtowncoder avatar Apr 19 '24 03:04 cowtowncoder

@cowtowncoder I've prepared a branch with the pool type replacement for all JSON parsers that use jackson-core under the hood and will run same benchmarks on the same environment this weekend: https://github.com/plokhotnyuk/jsoniter-scala/compare/master...new-concurrent-deque-pool

plokhotnyuk avatar Apr 19 '24 06:04 plokhotnyuk

While full tests on desktop are running still, I tested JsonRecyclerPools.newConcurrentDequePool() for jackson-module-scala benchmarks on my laptop Intel® Core™ i7-11800H CPU @ 2.3GHz (max 4.6GHz), RAM 64Gb DDR4-3200 using 16 threads and async-profiler:

sbt -java-home /usr/lib/jvm/jdk-21 jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof "async:dir=target/async-reports;interval=1000000;output=flamegraph;libPath=/opt/async-profiler/lib/libasyncProfiler.so" -jvmArgsAppend "-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" --p size=128 -wi 5 -i 5 -t 16 jacksonScala'

For small (less than 100 bytes) messages up to 90% of CPU cycles are spent on acquiring, releasing and allocating byte buffers.

I've highlighted them in the following flamegraph:

image

plokhotnyuk avatar Apr 20 '24 08:04 plokhotnyuk

I ran the same ADTWriting.jacksonScala test with a few different Jackson versions and tried a few different pool implementations with Jackson 2.17.0

  • 2.17.1-SNAPSHOT already has some changes that cause a small slowdown in ADTWriting (with ThreadLocal buffer recycler).
  • 2.18.0-SNAPSHOT is faster than 2.17.x and 2.16.x (when using ThreadLocal buffer recycler).
  • Among the different pool implementations, ThreadLocal is faster than SharedConcurrentDeque which is faster than NewConcurrentDeque.
  • I later ran with '-t 3' (3 threads). The pool order was the same but ThreadLocal was much better than the 2 ConcurrentDeque implementations with multiple threads.
sbt jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof gc .*ADTWriting.jacksonScala'

2024-04-20 Default jsoniter-scala (Jackson 2.17.0 with ThreadLocal buffer recycler)

[info] Benchmark                                    Mode  Cnt        Score        Error   Units
[info] ADTWriting.jacksonScala                     thrpt    5  2290252.179 ± 107995.529   ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate       thrpt    5     1222.806 ±     57.697  MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm  thrpt    5      560.000 ±      0.001    B/op
[info] ADTWriting.jacksonScala:gc.count            thrpt    5        8.000               counts
[info] ADTWriting.jacksonScala:gc.time             thrpt    5        7.000  

2024-04-20 jsoniter-scala changed to use Jackson 2.16.2 with ThreadLocal buffer recycler)

[info] Benchmark                                    Mode  Cnt        Score        Error   Units
[info] ADTWriting.jacksonScala                     thrpt    5  2225989.506 ± 160118.600   ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate       thrpt    5     1188.489 ±     85.712  MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm  thrpt    5      560.000 ±      0.001    B/op
[info] ADTWriting.jacksonScala:gc.count            thrpt    5        7.000               counts
[info] ADTWriting.jacksonScala:gc.time             thrpt    5        7.000  

2024-04-20 jsoniter-scala changed to use Jackson 2.17.1-SNAPSHOT with ThreadLocal buffer recycler)

[info] Benchmark                                    Mode  Cnt        Score        Error   Units
[info] ADTWriting.jacksonScala                     thrpt    5  2099028.720 ± 156226.482   ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate       thrpt    5     1120.687 ±     83.382  MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm  thrpt    5      560.000 ±      0.001    B/op
[info] ADTWriting.jacksonScala:gc.count            thrpt    5        8.000               counts
[info] ADTWriting.jacksonScala:gc.time             thrpt    5        7.000  

2024-04-20 Default jsoniter-scala (Jackson 2.17.0 with NewConcurrentDeque buffer recycler)

[info] Benchmark                                    Mode  Cnt        Score        Error   Units
[info] ADTWriting.jacksonScala                     thrpt    5  1686594.004 ± 213058.648   ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate       thrpt    5     1093.464 ±    137.984  MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm  thrpt    5      680.000 ±      0.001    B/op
[info] ADTWriting.jacksonScala:gc.count            thrpt    5        7.000               counts
[info] ADTWriting.jacksonScala:gc.time             thrpt    5        5.000 

2024-04-20 Default jsoniter-scala (Jackson 2.17.0 with SharedConcurrentDeque buffer recycler)

[info] Benchmark                                    Mode  Cnt        Score        Error   Units
[info] ADTWriting.jacksonScala                     thrpt    5  1981627.720 ± 242523.449   ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate       thrpt    5     1103.358 ±    135.033  MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm  thrpt    5      584.000 ±      0.001    B/op
[info] ADTWriting.jacksonScala:gc.count            thrpt    5        7.000               counts
[info] ADTWriting.jacksonScala:gc.time             thrpt    5        7.000  

pjfanning avatar Apr 20 '24 18:04 pjfanning

@cowtowncoder @pjfanning JsonRecyclerPools.newConcurrentDequePool() pool is too expansive for small (less than 1Kb) messages, especially in the multithread environment.

Here are comparisons of benchmark results between threadLocalPool() and newConcurrentDequePool() mitigations for the recycler pool performance regression jackson-core 2.17.0 on different JVMs using 16 threads (move Ignoring deviations below 0% slider to 0 and scroll down to Declined benchmarks section):

GraalVM JDK 22 GraalVM JDK 21 GraalVM JDK 17 GraalVM Community JDK 23 GraalVM Community JDK 21 JDK 23 JDK 21 JDK 17

plokhotnyuk avatar Apr 24 '24 07:04 plokhotnyuk

@pjfanning Differences in your runs seem much more modest than what @plokhotnyuk reports. I wonder if that is due to lower thread count (less contention) or is to output size relatively bigge?

@plokhotnyuk I am still struggling to reconcile your results with understanding of what the specific reason for drastically lower performance is. It seems to me there are 3 possible causes:

  1. Contention (locking/blocking) of threads when acquiring and releasing buffers. I could see this being possible with really high thread counts and small content (because time between acquire/release calls is very short)
  2. Some implementation flaw that prevents successful reuse. If so, we'd see more allocations with concurrentDequePool implementation compared to thread local. And numbers would be similar to non-reusing "pool" implementation
  3. If there is no reuse of underlying JsonFactory -- my understanding is this has been ruled out as possibility.

If it is (1) that explains the problem, it would suggest there is not much to do, and default for traditional thread-case should remain ThreadLocalPool. And since Jackson should ideally not be in the business of auto-detecting runtime platform (or users' thread model choices), probably should be THE default pool, only to be overridden by developers/frameworks that know better.

But if there is any chance (2) could be the case it'd be good to figure out why reuse was not working.

cowtowncoder avatar Apr 26 '24 04:04 cowtowncoder

@pjfanning I am bit surprised by there being any different between "shared" and "new" pools -- when a single JsonFactory is reused there should be no difference at all (there's just a single pool used by the factory).

cowtowncoder avatar Apr 26 '24 04:04 cowtowncoder

@plokhotnyuk Unfortunately I can't see the expanded flamegraph: link to https://private-user-images.githubusercontent.com/890289/324165548-d40280ad-ed4d-4468-b4ab-c5f2de528fce.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQxMDY2NDYsIm5iZiI6MTcxNDEwNjM0NiwicGF0aCI6Ii84OTAyODkvMzI0MTY1NTQ4LWQ0MDI4MGFkLWVkNGQtNDQ2OC1iNGFiLWM1ZjJkZTUyOGZjZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNDI2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDQyNlQwNDM5MDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0yMWEzN2FmYjJlZjUzNzIwYjJiOWE0Mjk0ZTM5OTAyZmExZTU5ZTNiOGNmM2JkYWM5MDk3MTQ2M2M2YTZkZDdmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.4v64TVkhq9yp5Mgo2zqNc5hBCtcaa7ZDk_fPBTP16KA shows up empty.

But on " 90% of CPU cycles are spent on acquiring, releasing and allocating byte buffers." (on 16 threads, with 100 byte write (or read?)): I would assume that "allocating" part should be negligible, since if not, it would mean that recycling did not actually work: with reused JsonFactory, pool size would eventually settle for larger number of buffers used and no new allocations should occur. So all the overhead should be due other 2 things (acquiring and releasing), due to contention. Is this what you see?

If not I wonder how and why recycling itself did not work.

cowtowncoder avatar Apr 26 '24 04:04 cowtowncoder

@plokhotnyuk Thank you for links! Unfortunately I am not quite sure how to read them, like say https://jmh.morethan.io/?sources=https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages/jdk-17-t16.json,https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages-jackson-recycler/jdk-17-t16.json (JDK 17). There are many entries, differences -- so what should I be looking at? And what is the baseline against which changes are?

cowtowncoder avatar Apr 27 '24 21:04 cowtowncoder

@plokhotnyuk Thank you for links! Unfortunately I am not quite sure how to read them, like say https://jmh.morethan.io/?sources=https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages/jdk-17-t16.json,https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages-jackson-recycler/jdk-17-t16.json (JDK 17). There are many entries, differences -- so what should I be looking at? And what is the baseline against which changes are?

Both benchmark runs are for the latest jackson-core release.

The baseline is using .recyclerPool(JsonRecyclerPools.threadLocalPool()) mitigation for all affected JSON parsers (jackson-module-scala, play-json, json4s, weePickle).

It is compared with benchmark results for replacing that mitigation by .recyclerPool(JsonRecyclerPools.newConcurrentDequePool()) as committed here

To see up to 20x times slowdown as on the screenshot below please scroll down the page to the Declined Benchmarks section:

image

plokhotnyuk avatar Apr 30 '24 13:04 plokhotnyuk