thanos
thanos copied to clipboard
proxy: parallel merge and dedup
Is your proposal related to a problem?
Sorting series in the querier seems to be done using a single heap-like structure, which means we can only utilize one core for deduplication.
Describe the solution you'd like
It would be nice to be able to use multiple cores when fetching and deduplicating data in order to speed up query execution. Since the engine does not require series to be sorted, we could dedup using multiple heaps in parallel and iterate through all heaps when expanding series.
The only thing to be careful about is that series which are replicas of each other must end up on the same heap. This can be done by:
- first removing replica labels from received series
- hashmoding the rest of the labels
- putting the series on the heap that corresponds to the value from the hashmod operation
Describe alternatives you've considered
All Thanos stores can now shard series dynamically by
or without
certain labels. We could therefore make multiple requests to stores, each for one specific shard, where a shard is calculated by hashmoding series without replica labels used in the query.
However, this will put more load on stores, and is not a good strategy when querying Prometheus sidecars.
@GiedriusS let me know if I am missing something obvious here that could be a blocker :)
Maybe you could add a small diagram showing how all of this could work? Hard to understand now just from text :smile:
Hm hard to explain with a diagram, but let me try with code links.
From what I see, we start a gorouting for each store that we query, and that goroutine returns a single respSet.
Could we instead return N respSet
s, where each respSet
contains disjoin series. The way we decide which series ends up in which respSet
is by hashmoding series without their replicalabels, so that replicated series always end up in the same respSet
. Now we can take these N respSet
s and build N heaps in parallel.
Hashmoding could happen here when we push a received series into the buffer.
When expanding series in the engine, they don't need to be sorted. So we can simply iterate through each series in each heap in order through some compound iterator. Does that help?
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind
command if you wish to be reminded at some point in future.
I'll give this a shot in a few weeks.