fs2-grpc icon indicating copy to clipboard operation
fs2-grpc copied to clipboard

Performance in benchmarks

Open jtjeferreira opened this issue 4 years ago • 12 comments

Hi

I am opening this issue to document some findings about the fs2-grpc performance in this benchmark. I started this journey investigating why the akka-grpc results were so bad (https://discuss.lightbend.com/t/akka-grpc-performance-in-benchmarks/8236/) but then got curious what would be the numbers for other implementations...

The fs2-grpc implementation of the benchmark was done in this PR and the results I got were

Benchmark info:
37a7f8b Mon, 17 May 2021 16:06:05 +0100 João Ferreira scala zio-grpc implementatio
Benchmarks run: scala_fs2_bench scala_akka_bench scala_zio_bench java_hotspot_grpc_pgc_bench
GRPC_BENCHMARK_DURATION=50s
GRPC_BENCHMARK_WARMUP=5s
GRPC_SERVER_CPUS=3
GRPC_SERVER_RAM=512m
GRPC_CLIENT_CONNECTIONS=50
GRPC_CLIENT_CONCURRENCY=1000
GRPC_CLIENT_QPS=0
GRPC_CLIENT_CPUS=9
GRPC_REQUEST_PAYLOAD=100B
-----
Benchmark finished. Detailed results are located in: results/211705T162018
--------------------------------------------------------------------------------------------------------------------------------
| name               |   req/s |   avg. latency |        90 % in |        95 % in |        99 % in | avg. cpu |   avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| java_hotspot_grpc_pgc |   59884 |       16.19 ms |       40.65 ms |       54.12 ms |       88.15 ms |  256.21% |     204.7 MiB |
| scala_akka         |    7031 |      141.70 ms |      281.35 ms |      368.74 ms |      592.53 ms |  294.91% |    175.44 MiB |
| scala_fs2          |    7005 |      142.20 ms |      231.57 ms |      266.35 ms |      357.07 ms |  274.57% |    351.34 MiB |
| scala_zio          |    6835 |      145.74 ms |      207.45 ms |      218.25 ms |      266.37 ms |  242.61% |    241.43 MiB |
--------------------------------------------------------------------------------------------------------------------------------

I did some profiling with JFR and wanted to share the results

The biggest problem is GC:

image

Threads look fine: image

Memory:

image

And the culprits are scalapb.GeneratedMessageCompanion.parseFrom, fs2.grpc.server.Fs2ServerCall#sendMessage. There is also a lot of cats.effect.* stuff...

jtjeferreira avatar May 18 '21 13:05 jtjeferreira

So after “wasting” all these hours profiling, I noticed that the heap settings were not being applied. After changing that, the results are a bit better.

https://discuss.lightbend.com/t/akka-grpc-performance-in-benchmarks/8236/14

jtjeferreira avatar May 18 '21 23:05 jtjeferreira

I was doing some more profiling after having fixed the heap settings, and even though the results are much better I noticed the usage of unsafeRunSync. (the pink in the left side)

image

I am not very experienced with cats-effect, but my understanding is that we could use Async FFI without having to call "unsafe" code

jtjeferreira avatar May 19 '21 11:05 jtjeferreira

For reference here it is the flamegraph for the java benchmark image

The netty part is pretty similar (purple right side), but comparing with the picture from the post above then we have the cats effect threads (right side), and the ServiceBuilder Executor threads (left side)

jtjeferreira avatar May 19 '21 13:05 jtjeferreira

You could try to see if it makes things faster by using runtime's compute pool as Executor by new Executor { def execute(cmd Runnable): Unit = runtime.compute.execute(cmd) }. Might make a difference.

ahjohannessen avatar May 20 '21 12:05 ahjohannessen

You could try to see if it makes things faster by using runtime's compute pool as Executor by new Executor { def execute(cmd Runnable): Unit = runtime.compute.execute(cmd) }. Might make a difference.

I tried that and even new Executor { def execute(cmd Runnable): Unit = IO.blocking(cmd.run()).runUnsafeSync }. If I recall correctly the application was being killed by OOM. I even tried upgrading to latest cats-effect in case this would make a difference, but it didnt

jtjeferreira avatar May 20 '21 13:05 jtjeferreira

I did try it and memory did not go up and it was around 2k faster than otherwise. However, I suppose there is unnecessary context shifting, but not sure what is the best way to avoid that.

ahjohannessen avatar May 20 '21 13:05 ahjohannessen

I did try it and memory did not go up and it was around 2k faster than otherwise. However, I suppose there is unnecessary context shifting, but not sure what is the best way to avoid that.

Maybe I was doing something wrong, but I will try again later today and will let you know. What were the benchmark settings you were using? Meanwhile did you had a look at that unsafeRunSync if there are ways to avoid it?

jtjeferreira avatar May 20 '21 14:05 jtjeferreira

I cannot remember what I did, but tried again by allocating more CPU to see what happened:

--------------------------------------------------------------------------------------------------------------------------------
| name               |   req/s |   avg. latency |        90 % in |        95 % in |        99 % in | avg. cpu |   avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| scala_fs2          |   37711 |       26.28 ms |       47.02 ms |       72.41 ms |      148.46 ms | 1087.87% |    411.78 MiB |
--------------------------------------------------------------------------------------------------------------------------------
Benchmark Execution Parameters:
b81da51 Wed, 19 May 2021 23:36:38 +0200 GitHub Merge pull request #145 from LesnyRumcajs/harden-analysis-cleanup
- GRPC_BENCHMARK_DURATION=30s
- GRPC_BENCHMARK_WARMUP=10s
- GRPC_SERVER_CPUS=20
- GRPC_SERVER_RAM=1024m
- GRPC_CLIENT_CONNECTIONS=50
- GRPC_CLIENT_CONCURRENCY=1000
- GRPC_CLIENT_QPS=0
- GRPC_CLIENT_CPUS=9
- GRPC_REQUEST_PAYLOAD=100B
All done.

and grpc-java

--------------------------------------------------------------------------------------------------------------------------------
| name               |   req/s |   avg. latency |        90 % in |        95 % in |        99 % in | avg. cpu |   avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| java_hotspot_grpc_pgc |   72310 |       12.76 ms |       23.63 ms |       34.47 ms |       80.52 ms |  574.66% |    396.55 MiB |
--------------------------------------------------------------------------------------------------------------------------------
Benchmark Execution Parameters:
b81da51 Wed, 19 May 2021 23:36:38 +0200 GitHub Merge pull request #145 from LesnyRumcajs/harden-analysis-cleanup
- GRPC_BENCHMARK_DURATION=30s
- GRPC_BENCHMARK_WARMUP=10s
- GRPC_SERVER_CPUS=20
- GRPC_SERVER_RAM=1024m
- GRPC_CLIENT_CONNECTIONS=50
- GRPC_CLIENT_CONCURRENCY=1000
- GRPC_CLIENT_QPS=0
- GRPC_CLIENT_CPUS=9
- GRPC_REQUEST_PAYLOAD=100B
All done.

Most likely context switching that is killing the performance for fs2-grpc.

ahjohannessen avatar May 20 '21 15:05 ahjohannessen

Can you have a look at https://github.com/typelevel/fs2-grpc/pull/394 and see if it helps? (should slightly reduce the number of unsafeRun operations per request)

fiadliel avatar Jun 12 '21 19:06 fiadliel

No, never mind. There wasn't actually much to improve there.

But there is another issue: https://github.com/typelevel/fs2-grpc/pull/39 -- flow control. I mentioned subtleties before, but I've completely lost context. I'll start having another look at this. But flow control is important -- right now, the "window size" for data from the client is always 1 or 0, and this could have a major impact on throughput.

fiadliel avatar Jun 12 '21 20:06 fiadliel

And that's not an issue in non-streaming scenarios, which is the case in the benchmark 😞 I'd better actually download the benchmark code…

fiadliel avatar Jun 12 '21 20:06 fiadliel

If it is of any usefulness, Lightbend blogged about how they increased Akka gRPC performance https://www.lightbend.com/blog/akka-grpc-update-delivers-1200-percent-performance-improvement

sideeffffect avatar Jan 15 '22 21:01 sideeffffect