influxdb Flight client is too slow in Monolith OSS & PRO

Problem Statement

On Monolith OSS, while doing Performance runs we have observed that the Comparison between CURL and Flight query performance raises the doubt whether there could be an underlying bug with the Flight implementation in Monolith as the Flight latency is way too slow.

Expected Results

Flight naturally will be slower than Curl but 2x times and in some cases 10x times are not expected.

Metrics to support the identification of the problem

Below are the metrics captured for various queries with curl and flight,

Most of the queries irrespective of the query ranges report being about 2x times faster with curl
last value cache shows nearly 10x better performance with curl

Query load (concurrency = 1)	OSS - 5m query range(P95 latency in ms)	OSS - 5m query range(P95 latency in ms)	OSS - 10m query range(P95 latency in ms)	OSS - 10m query range(P95 latency in ms)	OSS - 60m query range(P95 latency in ms)	OSS - 60m query range(P95 latency in ms)
Client	flight	curl	flight	curl	flight	curl
c13	125.4	46.69	127.62	47.46	129.48	48.76
c14	89.5	45.01	90.28	58.48	122.19	47.38
f1	78.24	44.65	94.27	60.41	98.48	63.21
lastcachevalue_1	89.45	8.62	90.23	9	92.23	10.5
lastcachevalue_10	94.34	10.62	90.34	8.62	90.06	8.43

Reach out to @MaduMitha-Ravi for more details/metrics.

Oct 15 '24 13:10 MaduMitha-Ravi

@MaduMitha-Ravi any chance the code that uses the Flight client can be shared?

Oct 15 '24 18:10 hiltontj

@adrian-thurston Can you share please?

Oct 15 '24 18:10 MaduMitha-Ravi

Hi @hiltontj, code is here:

https://github.com/influxdata/perf-tools/blob/b460bba302257635a8724c9eb60cebb6bc9d556f/ic/query/query.go#L1154

Oct 15 '24 22:10 adrian-thurston

fyi - Updated the title, the slowness is observed in OSS and PRO.

Oct 18 '24 14:10 MaduMitha-Ravi

@MaduMitha-Ravi - I've created an experimental branch to identify the bottle neck. You can see the actual changes in this draft PR. We have a mechanism in monolith to use the same port for both HTTP/GRPC (flightsql), but in my branch I've made them listen on separate ports (8181 for HTTP by default and 8183 for GRPC by default). I'd like to know if the difference in performance between flightsql client and curl can still be observed. Thanks

Oct 22 '24 12:10 praveen-influx

I've managed to reproduce the issue locally I think. Below is snapshot of a run, the units are in millis

2024-10-24T13:45:09.642371Z  INFO server::flight: =========================================================
2024-10-24T13:45:09.642398Z  INFO server::flight: HTTP STATS:  min=15.0 max=38.0 avg=19.17
2024-10-24T13:45:09.642432Z  INFO server::flight: GRPC STATS:  min=60.0 max=104.0 avg=66.67
2024-10-24T13:45:09.642463Z  INFO server::flight: =========================================================

When running the same tests against my experimental branch the latency for flightsql calls go down (comparable to http)

2024-10-24T15:10:22.216810Z  INFO server::flight: =========================================================
2024-10-24T15:10:22.216816Z  INFO server::flight: HTTP STATS:  min=11.0 max=22.0 avg=13.66
2024-10-24T15:10:22.216824Z  INFO server::flight: GRPC STATS:  min=10.0 max=27.0 avg=13.77
2024-10-24T15:10:22.216831Z  INFO server::flight: =========================================================

One thing to note is, the issue wasn't reproducible in debug builds, so I used cargo build --release. I haven't checked the quick-release profile although I'd expect it to behave similar to release build.

Once the build is ready, you can generate the traffic using load generator,

influxdb3_load_generator write --builtin-spec one_mil

and run the binary itself

LOG_FILTER=error INFLUXDB3_WAL_SNAPSHOT_SIZE=1 RUST_BACKTRACE=1 ./influxdb3-main serve --host-id localhost --object-store file --data-dir test-data

then run this test in branch

The next step would be to tidy up my experimental branch and get it merged. It would be good to run both HTTP/GRPC on same port, so we may revisit this option in the future. But for now, we can live with running them on separate ports.

Oct 24 '24 16:10 praveen-influx

@praveen-influx Is the above experiment with TCP_NODELAY if so we can have them on the same ports. To test things, we (perf team) just need a patch or branch of your work (we can merge after analyzing the results).

Oct 24 '24 17:10 MaduMitha-Ravi

No @MaduMitha-Ravi - it is purely just separating the ports. I think this is different to iox in the sense that this issue (difference in time) can be observed always on release builds. This suggests that the compiler couldn't optimise certain path in the extra machinery we've built to run them both on same port. So the change to run them on separate ports 8181 for HTTP and 8183 for GRPC has been sufficient to bring the latency down.

Having said that I think I'll add a separate commit to add TCP_NODELAY as well.

Oct 24 '24 17:10 praveen-influx

Thanks @praveen-influx for explaining to me :)

Oct 24 '24 17:10 MaduMitha-Ravi

@MaduMitha-Ravi / @adrian-thurston - thanks for TCP_NODELAY suggestion.

I managed to do a run with TCP_NODELAY enabled, turns out setting just this flag does the trick. In fact looking at the difference in code, I used tonic directly to separate out the port for grpc calls and it is not the fact that it's running on separate port but instead the tonic library defaults to setting TCP_NODELAY to true under the hood, see here.

2024-10-24T20:20:22.750460Z  INFO server::flight: =========================================================
2024-10-24T20:20:22.750470Z  INFO server::flight: HTTP STATS:  min=11.0 max=22.0 avg=13.68
2024-10-24T20:20:22.750484Z  INFO server::flight: GRPC STATS:  min=10.0 max=21.0 avg=13.43
2024-10-24T20:20:22.750498Z  INFO server::flight: =========================================================

So, looks like we should be ok to run both HTTP/GRPC on the same port without extra latency :tada: I'll submit the PR, I'm still puzzled a bit as to why debug build doesn't show this issue. I might look into it separately.

Oct 24 '24 20:10 praveen-influx

Results after fix

Below are the numbers after enablement of TCP_NODELAY, Flight slowness is solved and is in par with Curl (mostly after than CURL)

Query Name	PRO - Flight	PRO - CURL	OSS - Flight	OSS - CURL
c13	9.12	12.77	11.08	15.07
c14	8.42	12.40	11.41	15.14
f1	9.76	13.92	11.42	15.56

Conclusion

The Flight slowness issue is now fixed.

Oct 28 '24 13:10 MaduMitha-Ravi