tonic icon indicating copy to clipboard operation
tonic copied to clipboard

Tail latency on `master` is elevated

Open kvcache opened this issue 1 year ago • 2 comments

Bug Report

Version

master@177c1f3d7407c60e5fb782a23fee935e7e3fa12d

  • tonic
  • tonic-build
  • tonic-reflection
  • tonic-web

Platform

Building aarch64-unknown-linux-gnu for Linux 5.10.184-175.749.amzn2.aarch64 #1 SMP Wed Jul 12 18:40:25 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux running as a simple systemd service (no container)

Description

I tried to update my service to use tonic master for some of new stuff before the next release. It failed our performance tests, whereas we expected it to be about the same as v0.10.2. Reverting to v0.10.2 fixed the performance again.

We monitor p99.99 (BLUE/top line), p99.9 (PINK/second line), p99 (YELLOW/third line sometimes visible), p90 (and a few more) in a daily performance test to catch regressions before shipping them to users. This test is measuring the round trip time to execute a small unary rpc, and receive a message from that rpc on a separate server-streaming channel. The server was using v0.10.2, and the client was using master.

client (@master)               server (@v0.10.2)
   ---     subscribe         ---> (long-living server streaming rpc)

   ---     unary timestamp1  ---> (timestamp sent from client)
   <--         ok            ---  (server says ok)

   <--- streaming timestamp1 ---  (server sends the timestamp from the unary call over to the streaming subscriber)
record now() - timestam1

This is about what we expected to see on master. It is what we saw after reverting the client to v0.10.2: image

This is what we observed with tonic on master: image

Below p99, the latencies are pretty interchangeable, but at and above they're quite regressed. Unfortunately, [1] I do not currently have a minimal reproducer to offer for this, [2] I do not know which commit caused this (vectorized io maybe?), [3] I do not know if it reproduces outside of servers on AWS. I'm hoping to raise awareness of a possible performance regression before the next release so I can upgrade when the release comes :-).

kvcache avatar Jan 19 '24 19:01 kvcache

It would be helpful if you could potentially run a git bisect to see which commit introduced this if its quite reproducible.

LucioFranco avatar Jan 25 '24 14:01 LucioFranco