cats-effect
cats-effect copied to clipboard
[3.3.0] Performance regression
This issue was originally created in fs2, so version numbers below refer to fs2.
I've observed a performance degradation in some scenarios on 3.2.3 version.
The throughput on small byte streams remained the same.
The throughput on bigger byte streams is decreased roughly by ~20%.
The memory allocation decreased by 10-15% on bigger byte streams.
See the benchmarks below for more details.
Stream usage
The project utilizes a TCP socket from the fs2-io module. I cannot share many details due to NDA, but the generalized usage of the socket is the following one:
val streamDecoder: StreamDecoder[Structure] =
StreamDecoder.many(StructureDecoder)
socket.reads
.through(streamDecoder.toPipeByte)
.evalMap(structure => queue.offer(Right(structure)))
.compile
.background
Consumed bytes per single invocation:
- createOne: 167 bytes
- returnRandomUUID: 135 bytes
- return100Record: 2683 bytes
3.2.2
Operation average time:
Benchmark Mode Cnt Score Error Units
DriverBenchmark.createOne avgt 25 4.725 ± 0.396 ms/op
DriverBenchmark.return100Records avgt 25 8.613 ± 0.488 ms/op
DriverBenchmark.returnRandomUUID avgt 25 3.542 ± 1.156 ms/op
Memory allocation:
DriverBenchmark.return100Records:·gc.alloc.rate 201.323 ± 11.576 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm 3117144.635 ± 13036.746 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space 199.083 ± 23.093 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm 3076695.652 ± 283692.848 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen 0.074 ± 0.113 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm 1266.657 ± 1981.461 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space 0.484 ± 0.444 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm 7614.493 ± 6998.312 B/op
DriverBenchmark.return100Records:·gc.count 72.000 counts
DriverBenchmark.return100Records:·gc.time 300.000 ms
3.2.3
Operation average time:
Benchmark Mode Cnt Score Error Units
DriverBenchmark.createOne avgt 25 4.862 ± 0.414 ms/op
DriverBenchmark.return100Records avgt 25 11.008 ± 0.356 ms/op
DriverBenchmark.returnRandomUUID avgt 25 3.068 ± 0.299 ms/op
Memory allocation:
DriverBenchmark.return100Records:·gc.alloc.rate 169.961 ± 6.376 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm 3383113.516 ± 12432.017 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space 175.288 ± 24.073 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm 3484911.125 ± 432781.648 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen 0.061 ± 0.075 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm 1224.015 ± 1529.533 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space 0.484 ± 0.566 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm 9489.291 ± 11071.370 B/op
DriverBenchmark.return100Records:·gc.count 56.000 counts
DriverBenchmark.return100Records:·gc.time 433.000 ms
I wonder if this is related to https://github.com/typelevel/fs2/pull/2734. Would you mind trying with these versions?
before: 3.2-111-594f8fe
after: 3.2-124-832c161
@armanbilge I was thinking about this change as well.
Sure, I will rerun benchmarks against these versions.
@armanbilge not much of a difference 🤔
3.2-111-594f8fe
Benchmark Mode Cnt Score Error Units
DriverBenchmark.createOne avgt 25 4.753 ± 0.383 ms/op
DriverBenchmark.return100Records avgt 25 10.699 ± 0.411 ms/op
DriverBenchmark.returnRandomUUID avgt 25 3.078 ± 0.283 ms/op
3.2-124-832c161
Benchmark Mode Cnt Score Error Units
DriverBenchmark.createOne avgt 25 4.794 ± 0.494 ms/op
DriverBenchmark.return100Records avgt 25 10.758 ± 0.408 ms/op
DriverBenchmark.returnRandomUUID avgt 25 3.065 ± 0.281 ms/op
3.2.3
Benchmark Mode Cnt Score Error Units
DriverBenchmark.createOne avgt 25 4.838 ± 0.420 ms/op
DriverBenchmark.return100Records avgt 25 11.064 ± 0.435 ms/op
DriverBenchmark.returnRandomUUID avgt 25 3.119 ± 0.326 ms/op
If you have bandwidth to bisect further, you can use the --first-parent option to get only merge commits and use the hash to look up the version in maven:
https://repo1.maven.org/maven2/co/fs2/fs2-io_2.13/
@armanbilge I will experiment with different versions during the day
Thanks!
Nothing really jumps out when looking at the diff between v3.2.2 and 594f8fe. Most likely culprits:
- https://github.com/typelevel/fs2/pull/2720
- Upgrade to CE 3.3
Indeed, my project depends on cats-effect-std, hence IO comes from the fs2 package.
fs2 3.2.2 with cats-effect 3.3.0 shows the same throughput as 3.2.3.
Thank you for your help and sorry for wasting your time.
Not a waste of time at all, thanks for reporting! It would be good to understand what's causing the slow down in cats-effect. My guess is that this is the overhead of tracing but that's a total guess.
Cats Effect 3.0.0 did not have tracing. It came with 3.2.0. The latest fs2 3.2.3 should be on CE 3.3.0.
Can you try the benchmark with the system property -Dcats.effect.tracing.mode=none? More details here: https://typelevel.org/cats-effect/docs/tracing#configuration
I'm running benchmarks on a different machine so numbers differ from yesterday.
fs2 3.2.2
cats-effect 3.2.9 without tracing
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 7.010 ± 0.337 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 246.768 ± 12.250 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3068430.610 ± 10566.046 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 245.038 ± 18.895 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3051556.639 ± 232239.895 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.052 ± 0.101 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 705.673 ± 1401.404 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.419 ± 0.463 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 5707.785 ± 6392.072 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 80.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 275.000 ms
DriverBenchmark.return100Records:·stack avgt NaN ---
cats-effect 3.2.9 with tracing
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 7.986 ± 0.315 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 217.543 ± 9.826 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3097923.212 ± 9261.729 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 215.997 ± 25.889 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3074210.574 ± 334508.436 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.031 ± 0.037 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 454.083 ± 576.467 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.467 ± 0.715 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 6855.137 ± 10494.727 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 77.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 282.000 ms
cats-effect 3.3.0 without tracing
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 7.792 ± 0.323 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 220.893 ± 9.084 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3065325.686 ± 21557.399 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 221.037 ± 22.768 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3065786.971 ± 270556.884 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.063 ± 0.116 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 943.848 ± 1749.708 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.421 ± 0.573 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 6075.865 ± 8203.016 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 80.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 281.000 ms
cats-effect 3.3.0 with tracing
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 10.010 ± 0.504 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 178.274 ± 9.522 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3185210.074 ± 13694.388 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 176.030 ± 26.087 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3143461.596 ± 418077.571 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.051 ± 0.084 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 910.347 ± 1460.952 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.488 ± 0.541 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 8868.587 ± 9979.185 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 58.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 408.000 ms
DriverBenchmark.return100Records:·stack avgt NaN ---
fs2 3.2.3
cats-effect 3.2.9 without tracing
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 7.325 ± 0.355 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 249.123 ± 12.370 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3244986.275 ± 11358.509 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 251.110 ± 25.120 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3270291.158 ± 288189.174 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.036 ± 0.072 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 513.287 ± 1066.717 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.371 ± 0.447 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 5275.698 ± 6492.410 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 87.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 255.000 ms
cats-effect 3.2.9 with tracing
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 8.001 ± 0.343 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 217.396 ± 9.389 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3197155.595 ± 16900.517 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 218.988 ± 24.045 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3362897.380 ± 395288.166 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.027 ± 0.035 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 437.723 ± 581.754 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.467 ± 0.715 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 7273.963 ± 11126.741 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 77.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 299.000 ms
cats-effect 3.3.0 without tracing
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 7.662 ± 0.310 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 237.331 ± 10.179 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3219401.063 ± 15108.885 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 241.757 ± 23.280 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3283962.366 ± 312667.926 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.024 ± 0.030 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 346.730 ± 441.142 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.350 ± 0.505 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 5005.121 ± 7250.275 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 80.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 284.000 ms
cats-effect 3.3.0 with tracing
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 10.424 ± 0.391 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 182.919 ± 6.906 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3410284.489 ± 39986.912 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 182.465 ± 26.496 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3402766.732 ± 477752.507 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.143 ± 0.145 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 2694.292 ± 2722.628 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.768 ± 0.501 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 14438.726 ± 9527.695 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 61.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 409.000 ms
Thanks for the update!
fs2 3.2.2 with cats-effect 3.3.0 shows the same throughput as 3.2.3.
~~Can you repeat this experiment with tracing disabled?~~
Sorry, actually no. I think you want to compare fs2 3.2.2 with CE 3.2.9 and CE 3.3.0, both with tracing disabled.
@armanbilge I've updated the comment above
@iRevive Thanks! For the fs2 3.2.2 / CE 3.2.9, I assume that was with tracing? Can you run those with tracing disabled too? (same option)
@armanbilge yeah, the difference is caused by tracing.
I've updated my comment with extra benchmarks. Seems I covered all possible combinations.
Thanks for all your efforts, and sorry for the confusion. I think these are the 4 combinations we need:
- [ ] fs2 3.2.2 / CE 3.2.9 / tracing enabled
- [ ] fs2 3.2.2 / CE 3.2.9 / tracing disabled
- [x] fs2 3.2.2 / CE 3.3.0 / tracing enabled
- [x] fs2 3.2.2 / CE 3.3.0 / tracing disabled
You've done one of the first two, but I'm not sure which (I assume with tracing enabled, because that's the default).
Sorry to be a nuisance. Running CE 3.2.9 with tracing disabled is helpful to check if the performance change was caused due to changes in tracing or changes in the runtime itself.
@armanbilge done, see the updated comment https://github.com/typelevel/fs2/issues/2746#issuecomment-988800286
Awesome, thank you! 😁 let's have a look..
Looking at the time Delta, this seems about in line with what I would expect for tracing code which is mostly compute bound: about a 25% difference. Fully compute bound would be about 30%, more than likely.
The really interesting thing here is that GC time though. The tracing benchmark hit fewer GC iterations but still took almost twice as long. I wonder if that means that we could optimize tracing a bit further in practice by streamlining GC costs?
Edit: actually that shift only happened in 3.3.0, so it's almost certainly being caused by the weak bag shenanigans. We're forcing the GC to work harder in order to avoid bogging down the critical path.
Benchmark 3.2.9- 3.2.9+ 3.3.0- 3.3.0+ 3.2.9- 3.2.9+ 3.3.0- 3.3.0+
DriverBenchmark.return100Records 7.010 7.986 7.792 10.010 7.325 8.001 7.662 10.424
DriverBenchmark.return100Records:·gc.alloc.rate 246.768 217.543 220.893 178.274 249.123 217.396 237.331 182.919
DriverBenchmark.return100Records:·gc.alloc.rate.norm 3068430.610 3097923.212 3065325.686 3185210.074 3244986.275 3197155.595 3219401.063 3410284.489
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space 245.038 215.997 221.037 176.030 251.110 218.988 241.757 182.465
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm 3051556.639 3074210.574 3065786.971 3143461.596 3270291.158 3362897.380 3283962.366 3402766.732
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen 0.052 0.031 0.063 0.051 0.036 0.027 0.024 0.143
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm 705.673 454.083 943.848 910.347 513.287 437.723 346.730 2694.292
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space 0.419 0.467 0.421 0.488 0.371 0.467 0.350 0.768
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm 5707.785 6855.137 6075.865 8868.587 5275.698 7273.963 5005.121 14438.726
DriverBenchmark.return100Records:·gc.count 80.000 77.000 80.000 58.000 87.000 77.000 80.000 61.000
DriverBenchmark.return100Records:·gc.time 275.000 282.000 281.000 408.000 255.000 299.000 284.000 409.000
Thanks to the tabular formation, it's easier to see that I'm probably wrong on the compute bound but. Notice how tracing enabled/disabled makes very little difference on 3.2.9, but it matters a lot on 3.3. That suggests the fiber dump mechanism is primarily at fault. Absolutely fascinating.
Can you open an issue on CE for this? I think this is a meaningful performance regression.
We can transfer this one actually.
https://github.com/typelevel/cats-effect/pull/2368#issuecomment-925306033
We seem to be hitting this issue particularly hard in Scala.js (dev environment), with a performance degradation by a whole order of magnitude. For example, an iteration over ~100 IO calls which took between 1 and 2 seconds in 3.2.9, now takes around 30 in 3.3.0.
So far, I've been unable to avoid the issue by setting process.env.CATS_EFFECT_TRACING_MODE="none". Even though it's set when the application starts (before invoking any Scala.js code), the performance degradation persists.
Any pointers for further exploring solutions would be welcome.
@rpiaggio thanks for reporting and sorry for your troubles. Yes, unfortunately tracing does have a far more severe impact on JS than JVM for reasons we are still trying to understand. This is why it is only enabled in development mode. Which browser are you using?
However, setting CATS_EFFECT_TRACING_MODE as you've done should for the most part disable it, I'm not sure why that's not working for you. I'm assuming this related to the explore project linked above, so I'll take a look there.
See also:
- https://github.com/typelevel/cats-effect/issues/2284
- https://github.com/typelevel/cats-effect/issues/2558
- https://github.com/typelevel/cats-effect/pull/2207#issuecomment-909306546
- https://github.com/typelevel/cats-effect/pull/2207#issuecomment-903199828
@rpiaggio one idea, to check that CATS_EFFECT_TRACING_MODE is working—can you raise an exception from inside CE and report the stacktrace? If it's enhanced then that means tracing is still enabled.
@armanbilge Thank you for the links.
Which browser are you using?
Chrome Version 96.0.4664.93 (Official Build) (x86_64)
I'll take a look there.
Thank you!
@rpiaggio one idea, to check that
CATS_EFFECT_TRACING_MODEis working—can you raise an exception from inside CE and report the stacktrace? If it's enhanced then that means tracing is still enabled.
Good idea. I tried your suggestion and I'm actually not getting enhanced stacktraces.
This is what I get when setting CATS_EFFECT_TRACING_MODE: "none":

vs what I get when not setting CATS_EFFECT_TRACING_MODE:

Aha, that is helpful. So now that we've confirmed tracing is off, you are definitely still seeing a significant performance regression? That is concerning!