fs2-grpc
fs2-grpc copied to clipboard
Performance in benchmarks
Hi
I am opening this issue to document some findings about the fs2-grpc performance in this benchmark. I started this journey investigating why the akka-grpc results were so bad (https://discuss.lightbend.com/t/akka-grpc-performance-in-benchmarks/8236/) but then got curious what would be the numbers for other implementations...
The fs2-grpc implementation of the benchmark was done in this PR and the results I got were
Benchmark info:
37a7f8b Mon, 17 May 2021 16:06:05 +0100 João Ferreira scala zio-grpc implementatio
Benchmarks run: scala_fs2_bench scala_akka_bench scala_zio_bench java_hotspot_grpc_pgc_bench
GRPC_BENCHMARK_DURATION=50s
GRPC_BENCHMARK_WARMUP=5s
GRPC_SERVER_CPUS=3
GRPC_SERVER_RAM=512m
GRPC_CLIENT_CONNECTIONS=50
GRPC_CLIENT_CONCURRENCY=1000
GRPC_CLIENT_QPS=0
GRPC_CLIENT_CPUS=9
GRPC_REQUEST_PAYLOAD=100B
-----
Benchmark finished. Detailed results are located in: results/211705T162018
--------------------------------------------------------------------------------------------------------------------------------
| name | req/s | avg. latency | 90 % in | 95 % in | 99 % in | avg. cpu | avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| java_hotspot_grpc_pgc | 59884 | 16.19 ms | 40.65 ms | 54.12 ms | 88.15 ms | 256.21% | 204.7 MiB |
| scala_akka | 7031 | 141.70 ms | 281.35 ms | 368.74 ms | 592.53 ms | 294.91% | 175.44 MiB |
| scala_fs2 | 7005 | 142.20 ms | 231.57 ms | 266.35 ms | 357.07 ms | 274.57% | 351.34 MiB |
| scala_zio | 6835 | 145.74 ms | 207.45 ms | 218.25 ms | 266.37 ms | 242.61% | 241.43 MiB |
--------------------------------------------------------------------------------------------------------------------------------
I did some profiling with JFR and wanted to share the results
The biggest problem is GC:

Threads look fine:

Memory:

And the culprits are scalapb.GeneratedMessageCompanion.parseFrom, fs2.grpc.server.Fs2ServerCall#sendMessage. There is also a lot of cats.effect.* stuff...
So after “wasting” all these hours profiling, I noticed that the heap settings were not being applied. After changing that, the results are a bit better.
https://discuss.lightbend.com/t/akka-grpc-performance-in-benchmarks/8236/14
I was doing some more profiling after having fixed the heap settings, and even though the results are much better I noticed the usage of unsafeRunSync. (the pink in the left side)

I am not very experienced with cats-effect, but my understanding is that we could use Async FFI without having to call "unsafe" code
For reference here it is the flamegraph for the java benchmark

The netty part is pretty similar (purple right side), but comparing with the picture from the post above then we have the cats effect threads (right side), and the ServiceBuilder Executor threads (left side)
You could try to see if it makes things faster by using runtime's compute pool as Executor by new Executor { def execute(cmd Runnable): Unit = runtime.compute.execute(cmd) }. Might make a difference.
You could try to see if it makes things faster by using runtime's compute pool as
Executorbynew Executor { def execute(cmd Runnable): Unit = runtime.compute.execute(cmd) }. Might make a difference.
I tried that and even new Executor { def execute(cmd Runnable): Unit = IO.blocking(cmd.run()).runUnsafeSync }. If I recall correctly the application was being killed by OOM. I even tried upgrading to latest cats-effect in case this would make a difference, but it didnt
I did try it and memory did not go up and it was around 2k faster than otherwise. However, I suppose there is unnecessary context shifting, but not sure what is the best way to avoid that.
I did try it and memory did not go up and it was around 2k faster than otherwise. However, I suppose there is unnecessary context shifting, but not sure what is the best way to avoid that.
Maybe I was doing something wrong, but I will try again later today and will let you know. What were the benchmark settings you were using? Meanwhile did you had a look at that unsafeRunSync if there are ways to avoid it?
I cannot remember what I did, but tried again by allocating more CPU to see what happened:
--------------------------------------------------------------------------------------------------------------------------------
| name | req/s | avg. latency | 90 % in | 95 % in | 99 % in | avg. cpu | avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| scala_fs2 | 37711 | 26.28 ms | 47.02 ms | 72.41 ms | 148.46 ms | 1087.87% | 411.78 MiB |
--------------------------------------------------------------------------------------------------------------------------------
Benchmark Execution Parameters:
b81da51 Wed, 19 May 2021 23:36:38 +0200 GitHub Merge pull request #145 from LesnyRumcajs/harden-analysis-cleanup
- GRPC_BENCHMARK_DURATION=30s
- GRPC_BENCHMARK_WARMUP=10s
- GRPC_SERVER_CPUS=20
- GRPC_SERVER_RAM=1024m
- GRPC_CLIENT_CONNECTIONS=50
- GRPC_CLIENT_CONCURRENCY=1000
- GRPC_CLIENT_QPS=0
- GRPC_CLIENT_CPUS=9
- GRPC_REQUEST_PAYLOAD=100B
All done.
and grpc-java
--------------------------------------------------------------------------------------------------------------------------------
| name | req/s | avg. latency | 90 % in | 95 % in | 99 % in | avg. cpu | avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| java_hotspot_grpc_pgc | 72310 | 12.76 ms | 23.63 ms | 34.47 ms | 80.52 ms | 574.66% | 396.55 MiB |
--------------------------------------------------------------------------------------------------------------------------------
Benchmark Execution Parameters:
b81da51 Wed, 19 May 2021 23:36:38 +0200 GitHub Merge pull request #145 from LesnyRumcajs/harden-analysis-cleanup
- GRPC_BENCHMARK_DURATION=30s
- GRPC_BENCHMARK_WARMUP=10s
- GRPC_SERVER_CPUS=20
- GRPC_SERVER_RAM=1024m
- GRPC_CLIENT_CONNECTIONS=50
- GRPC_CLIENT_CONCURRENCY=1000
- GRPC_CLIENT_QPS=0
- GRPC_CLIENT_CPUS=9
- GRPC_REQUEST_PAYLOAD=100B
All done.
Most likely context switching that is killing the performance for fs2-grpc.
Can you have a look at https://github.com/typelevel/fs2-grpc/pull/394 and see if it helps? (should slightly reduce the number of unsafeRun operations per request)
No, never mind. There wasn't actually much to improve there.
But there is another issue: https://github.com/typelevel/fs2-grpc/pull/39 -- flow control. I mentioned subtleties before, but I've completely lost context. I'll start having another look at this. But flow control is important -- right now, the "window size" for data from the client is always 1 or 0, and this could have a major impact on throughput.
And that's not an issue in non-streaming scenarios, which is the case in the benchmark 😞 I'd better actually download the benchmark code…
If it is of any usefulness, Lightbend blogged about how they increased Akka gRPC performance https://www.lightbend.com/blog/akka-grpc-update-delivers-1200-percent-performance-improvement