thanos
thanos copied to clipboard
Querier: Deterministic buffered queries
I plan to give a formal proposal + PoC, but starting an issue to cover the idea to solve the problem we have on the Querier fanout side which creates confusion, annoyance and overall extra latency.
I explained this problem to the team and on the latest Community Hours, but for initial info:
Problem
Current Querier algorithm looks following:

While it's not easy to spot, but what is happening in practice is that MergeSort is choosing which storeAPIs to fetch message from one-by-one. On average it's round robin, but that depends on order of series. In practice this looks as follows:
If we do query (start_timestamp_ms:1626134100000 end_timestamp_ms:1627344000000 matchers:<type:RE name:"__name__" value:"continuous_app_metric9.*" > ) for 2w from 5 different StoreAPIs (I literally used interactive_test), it takes 6+ seconds on my machine:

This test has Jaeger so I added lower granularity spans that tells us exactly wait-times for (https://github.com/bwplotka/thanos/commit/e75ce708234384fbf7dfcab13294aaf17aaa18fd):
- fetching single series message from store:
querier.streamSeriesSet.waitForStoreResult - merge sort actually receiving it from this goroutine:
querier.streamSeriesSet.waitForMergeSort
What we see for this query is that those go routines are starved for most of the times. E.g this sidecar fetch was waiting 500ms just to be able to pass it's data to merge sort:

This means sidecar itself gRPC service was waiting to send its messages too.
This happens for all stores. Essentially computation and slowest Stores causes additive latency which adds up.
And ONLY when we merge all, PromQL can start doing it's job due to https://github.com/prometheus/prometheus/issues/7326
Consequences:
- Latency is more or less cumulative, despite fetching concurrently. We essentially could do more concurrency here, since we buffer all messages anyway.
- We CANNOT tell which StoreAPI was the slowest (or its network) because of this dependency on merge sort. As you can see above trace shows 1.7s for this Store, but literally, it waited for 500+ms for Querier to take its data.
Note we see many of those "wait" times being significant in many messages writes:

Solution
The solution is kind of... simple. We have to buffer all and then merge all on top of it, then start eval PromQL as soon as possible. We have to fetch as fast as possible so StoreAPIs are free to perform further fetches and free memory. (sooner we do the job, the more we can do it!). Following diagram shows the general idea:

Looks good, I didn't think about this :+1: what do you mean by mmap on entry??
Good point - so we can buffer all, as we do right now, but to avoid having all series in memory on Quirer, we could put all in disk (tmp) and mmap. For large queries this is not too slow - as it reduces the peak mem, but allows consistent throughput.
This would would be needed if we would allow fanning out to infinite cardinality queries.
Looks good. Really nice diagrams and easy-to-understand tracing graphs as well.
I am sure we have the same StoreAPI starving problems in our environment, too. We have > 60 storeAPIs and I can always see sidecars waiting for a long time when I check Jaeger. At first, I thought this is due to network latency or the querier select concurrency. But looks like that's the actual cause 👍
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Yeah, this is still needed. If no one will pick this up soon then I might try implementing something.
Taking a fresh look (I might use this case in my Efficient Go book) - let's see how far I will get, there might be opportunity to work together on this.
First - we need solid e2e benchmark for this
:+1: That would be great, we would be able to compare then. I'm working on implementing a tournament tree in Go. We could use a min-heap but it's not as efficient. Couldn't find any implementation of a loser tree in Go :s
So essentially k-tree merge solution?
Yeah. Planning to open a PR next week, almost done. Decided to go with a heap based solution for now since that is easier to maintain and I doubt there would be much of an improvement with a selection/tournament tree because it needs auxiliary nodes. We actually already have Series() benchmarks - the improvement looks sweet.
PTAL fix: https://github.com/thanos-io/thanos/pull/5296
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Done with k-way I believe.