thanos Shared caching layer for thanos queriers

trafficstars

Is your proposal related to a problem?

We are running about 10 Thanos querier replicas for scaling purposes and we have 100+ sidecar + prometheus edge clusters across the world.

For our setup, the fanout problem is huge because of the scale. For example:

Info requests to sidecars

This is not a big problem because Info Request and Response are relatively cheap. In our setup, (number of queriers x number of sidecars) requests are sent every time. It is okay when scale is small. However, when you have more and more Thanos Queriers and edge sidecars, this is not very efficient.

metadata and rules query requests to sidecars

metrics metadata and rules query is something hardly changed for us. Especially metrics metadata. This is where caching would benefit us a lot.

more use case in the future

From https://github.com/thanos-io/thanos/issues/1611, we proposed to have some bloom filter like datastructure for reducing unnecessary series calls. Ideally, this could be done by introducing more data reported from the Info API and keep a bloom filter in queriers. If we can have a caching layer for the querier clusters then keeping the bloom filter up-to-date is not that expensive anymore.

Describe the solution you'd like

Have another type of cache for this use case. Maybe call it proxy cache? It is similar to caching bucket but this time we cache endpoint responses. Also I think the new galaxy cache is very suitable for this usecase.

Describe alternatives you've considered

Have some kind of gRPC proxy to do caching/passthrough based on the requests. I don't do any investigation right now but maybe something already suits my usecase.

Jan 09 '22 02:01 yeya24

So, something like galaxycache but for gRPC calls? Did I understand you correctly?

Jan 09 '22 17:01 GiedriusS

So, something like galaxycache but for gRPC calls? Did I understand you correctly?

Yes

Jan 09 '22 18:01 yeya24

I agree, this would be great. Perhaps this could be a LFX project? In the mean time I have been using a local version of this functionality: https://github.com/thanos-io/thanos/commit/310df0c5c982551d3412ce578cab52cf1e120ca6. It already has deduplicated thousands of Series() calls on my deployment. Perhaps we could merge this local version first and then work on the groupcache-esque one?

Jan 13 '22 13:01 GiedriusS

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Apr 16 '22 03:04 stale[bot]

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Sep 21 '22 06:09 stale[bot]

thanos thanos copied to clipboard

Shared caching layer for thanos queriers

Is your proposal related to a problem?

Info requests to sidecars

metadata and rules query requests to sidecars

more use case in the future

Describe the solution you'd like

Describe alternatives you've considered

thanos
thanos copied to clipboard