thanos
thanos copied to clipboard
Shared caching layer for thanos queriers
Is your proposal related to a problem?
We are running about 10 Thanos querier replicas for scaling purposes and we have 100+ sidecar + prometheus edge clusters across the world.
For our setup, the fanout problem is huge because of the scale. For example:
Info requests to sidecars
This is not a big problem because Info Request and Response are relatively cheap. In our setup, (number of queriers x number of sidecars) requests are sent every time. It is okay when scale is small. However, when you have more and more Thanos Queriers and edge sidecars, this is not very efficient.
metadata and rules query requests to sidecars
metrics metadata and rules query is something hardly changed for us. Especially metrics metadata. This is where caching would benefit us a lot.
more use case in the future
From https://github.com/thanos-io/thanos/issues/1611, we proposed to have some bloom filter like datastructure for reducing unnecessary series calls. Ideally, this could be done by introducing more data reported from the Info API and keep a bloom filter in queriers. If we can have a caching layer for the querier clusters then keeping the bloom filter up-to-date is not that expensive anymore.
Describe the solution you'd like
Have another type of cache for this use case. Maybe call it proxy cache? It is similar to caching bucket but this time we cache endpoint responses.
Also I think the new galaxy cache is very suitable for this usecase.
Describe alternatives you've considered
Have some kind of gRPC proxy to do caching/passthrough based on the requests. I don't do any investigation right now but maybe something already suits my usecase.
So, something like galaxycache but for gRPC calls? Did I understand you correctly?
So, something like galaxycache but for gRPC calls? Did I understand you correctly?
Yes
I agree, this would be great. Perhaps this could be a LFX project? In the mean time I have been using a local version of this functionality: https://github.com/thanos-io/thanos/commit/310df0c5c982551d3412ce578cab52cf1e120ca6. It already has deduplicated thousands of Series() calls on my deployment. Perhaps we could merge this local version first and then work on the groupcache-esque one?
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.