Flight client is too slow in Monolith OSS & PRO
Problem Statement
On Monolith OSS, while doing Performance runs we have observed that the Comparison between CURL and Flight query performance raises the doubt whether there could be an underlying bug with the Flight implementation in Monolith as the Flight latency is way too slow.
Expected Results
Flight naturally will be slower than Curl but 2x times and in some cases 10x times are not expected.
Metrics to support the identification of the problem
Below are the metrics captured for various queries with curl and flight,
- Most of the queries irrespective of the query ranges report being about 2x times faster with curl
- last value cache shows nearly 10x better performance with curl
| Query load (concurrency = 1) | OSS - 5m query range(P95 latency in ms) | OSS - 5m query range(P95 latency in ms) | OSS - 10m query range(P95 latency in ms) | OSS - 10m query range(P95 latency in ms) | OSS - 60m query range(P95 latency in ms) | OSS - 60m query range(P95 latency in ms) |
|---|---|---|---|---|---|---|
| Client | flight | curl | flight | curl | flight | curl |
| c13 | 125.4 | 46.69 | 127.62 | 47.46 | 129.48 | 48.76 |
| c14 | 89.5 | 45.01 | 90.28 | 58.48 | 122.19 | 47.38 |
| f1 | 78.24 | 44.65 | 94.27 | 60.41 | 98.48 | 63.21 |
| lastcachevalue_1 | 89.45 | 8.62 | 90.23 | 9 | 92.23 | 10.5 |
| lastcachevalue_10 | 94.34 | 10.62 | 90.34 | 8.62 | 90.06 | 8.43 |
Reach out to @MaduMitha-Ravi for more details/metrics.
@MaduMitha-Ravi any chance the code that uses the Flight client can be shared?
@adrian-thurston Can you share please?
Hi @hiltontj, code is here:
https://github.com/influxdata/perf-tools/blob/b460bba302257635a8724c9eb60cebb6bc9d556f/ic/query/query.go#L1154
fyi - Updated the title, the slowness is observed in OSS and PRO.
@MaduMitha-Ravi - I've created an experimental branch to identify the bottle neck. You can see the actual changes in this draft PR. We have a mechanism in monolith to use the same port for both HTTP/GRPC (flightsql), but in my branch I've made them listen on separate ports (8181 for HTTP by default and 8183 for GRPC by default). I'd like to know if the difference in performance between flightsql client and curl can still be observed. Thanks
I've managed to reproduce the issue locally I think. Below is snapshot of a run, the units are in millis
2024-10-24T13:45:09.642371Z INFO server::flight: =========================================================
2024-10-24T13:45:09.642398Z INFO server::flight: HTTP STATS: min=15.0 max=38.0 avg=19.17
2024-10-24T13:45:09.642432Z INFO server::flight: GRPC STATS: min=60.0 max=104.0 avg=66.67
2024-10-24T13:45:09.642463Z INFO server::flight: =========================================================
When running the same tests against my experimental branch the latency for flightsql calls go down (comparable to http)
2024-10-24T15:10:22.216810Z INFO server::flight: =========================================================
2024-10-24T15:10:22.216816Z INFO server::flight: HTTP STATS: min=11.0 max=22.0 avg=13.66
2024-10-24T15:10:22.216824Z INFO server::flight: GRPC STATS: min=10.0 max=27.0 avg=13.77
2024-10-24T15:10:22.216831Z INFO server::flight: =========================================================
One thing to note is, the issue wasn't reproducible in debug builds, so I used cargo build --release. I haven't checked the quick-release profile although I'd expect it to behave similar to release build.
Once the build is ready, you can generate the traffic using load generator,
influxdb3_load_generator write --builtin-spec one_mil
and run the binary itself
LOG_FILTER=error INFLUXDB3_WAL_SNAPSHOT_SIZE=1 RUST_BACKTRACE=1 ./influxdb3-main serve --host-id localhost --object-store file --data-dir test-data
The next step would be to tidy up my experimental branch and get it merged. It would be good to run both HTTP/GRPC on same port, so we may revisit this option in the future. But for now, we can live with running them on separate ports.
@praveen-influx Is the above experiment with TCP_NODELAY if so we can have them on the same ports. To test things, we (perf team) just need a patch or branch of your work (we can merge after analyzing the results).
No @MaduMitha-Ravi - it is purely just separating the ports. I think this is different to iox in the sense that this issue (difference in time) can be observed always on release builds. This suggests that the compiler couldn't optimise certain path in the extra machinery we've built to run them both on same port. So the change to run them on separate ports 8181 for HTTP and 8183 for GRPC has been sufficient to bring the latency down.
Having said that I think I'll add a separate commit to add TCP_NODELAY as well.
Thanks @praveen-influx for explaining to me :)
@MaduMitha-Ravi / @adrian-thurston - thanks for TCP_NODELAY suggestion.
I managed to do a run with TCP_NODELAY enabled, turns out setting just this flag does the trick. In fact looking at the difference in code, I used tonic directly to separate out the port for grpc calls and it is not the fact that it's running on separate port but instead the tonic library defaults to setting TCP_NODELAY to true under the hood, see here.
2024-10-24T20:20:22.750460Z INFO server::flight: =========================================================
2024-10-24T20:20:22.750470Z INFO server::flight: HTTP STATS: min=11.0 max=22.0 avg=13.68
2024-10-24T20:20:22.750484Z INFO server::flight: GRPC STATS: min=10.0 max=21.0 avg=13.43
2024-10-24T20:20:22.750498Z INFO server::flight: =========================================================
So, looks like we should be ok to run both HTTP/GRPC on the same port without extra latency :tada: I'll submit the PR, I'm still puzzled a bit as to why debug build doesn't show this issue. I might look into it separately.
Results after fix
Below are the numbers after enablement of TCP_NODELAY, Flight slowness is solved and is in par with Curl (mostly after than CURL)
| Query Name | PRO - Flight | PRO - CURL | OSS - Flight | OSS - CURL |
|---|---|---|---|---|
| c13 | 9.12 | 12.77 | 11.08 | 15.07 |
| c14 | 8.42 | 12.40 | 11.41 | 15.14 |
| f1 | 9.76 | 13.92 | 11.42 | 15.56 |
Conclusion
The Flight slowness issue is now fixed.