thanos icon indicating copy to clipboard operation
thanos copied to clipboard

store: add initial symbol tables support

Open GiedriusS opened this issue 3 years ago โ€ข 28 comments

Add initial support for symbol tables; symbol tables are sent via hints that are used to deduplicate strings. The main benefit of doing this is to reduce network traffic && reduce number of allocations needed for all of the different strings.

Strings are referenced by their number. In total, there could be math.MaxUint64 strings. To avoid blocking in querier, symbol tables / references are automatically adjusted depending on which store it is in the stores slice. In other words, the whole math.MaxUint64 space is divided into equal number of strings for each StoreAPI. If that limit is breached then raw strings are sent instead. This is taken care of by the LookupTable builder. We ensure that old versions are compatible with this one by passing over the maximum number of allowed unique strings via StoreRequest. If that limit is zero then we disable the building of the lookup table.

This compression is beneficial in almost all cases. The worst case is when there are a lot of unique metrics with unique strings in each metric. However, I strongly believe that this will only happen 1% of the time due to the nature of monitoring. This is because a set of metrics will always have some identical dimensions i.e. same labels with only one or two changing.

I have also attempted an alternative implementation whereas a Label could be oneof compressed labels or the regular labels. However, the implementation wasn't quite as elegant, cumbersome. This is much cleaner.

For now, only streaming Sidecar + Thanos Store is implemented. We could add support for this in other components as a follow-up.

Signed-off-by: Giedrius Statkeviฤius [email protected]

GiedriusS avatar Nov 18 '22 14:11 GiedriusS

Nice! Is there any benchmark for this?

yeya24 avatar Nov 18 '22 15:11 yeya24

I made a quick first pass, this looks pretty cool! I could not understand if the layered querier topology is covered as well. Is the table (indexing space) somehow recursively divided?

fpetkovski avatar Nov 19 '22 18:11 fpetkovski

This looks really amazing! Would be interesting to see benchmarks too, especially with the lookup table bimap!

saswatamcode avatar Nov 21 '22 06:11 saswatamcode

:wave:

I made a quick first pass, this looks pretty cool! I could not understand if the layered querier topology is covered as well. Is the table (indexing space) somehow recursively divided?

Each upper layer divides the uint64 space into equal parts for each StoreAPI. If any StoreAPI exceeds that limit then it errors out. Given that string references are unique, I think the function that I'm using should always remap all IDs properly.

I also thought about adding this behind a flag but is it realistic to have thousands and thousands of StoreAPIs that give millions of unique strings? :thinking: math.MaxUint64 is 18446744073709551615 :smile: I don't think it should even be reached. An even then maybe we could bump to uint128.

What benchmarks would you like to see for this? If this code looks good to my team mates then I'll try this out in prod and see the results :eye:

GiedriusS avatar Nov 21 '22 11:11 GiedriusS

I agree that it's hard to have a benchmark since a lot of benefits will come from reducing network bandwidth. So anything we run on a single machine will not be representative.

fpetkovski avatar Nov 21 '22 12:11 fpetkovski

What benchmarks would you like to see for this? If this code looks good to my team mates then I'll try this out in prod and see the results

I was thinking benchmarking the compression/decompression might be useful so that if this is changed in the future, we can catch regressions! ๐Ÿ™‚ But yes, it won't capture the network bandwidth benefits, so trying in prod sounds better for that.

saswatamcode avatar Nov 22 '22 08:11 saswatamcode

Here's how network traffic dropped after deploying this: image

Dropped by 3x. This is one of the servers that is only evaluating rules via Thanos Ruler so the effect is visible the most.

GiedriusS avatar Nov 22 '22 08:11 GiedriusS

Was there any reduction in query latency?

fpetkovski avatar Nov 22 '22 08:11 fpetkovski

Was there any reduction in query latency?

Doesn't look so, maybe just a little bit :( but with much smaller network usage, it's possible to stuff more recording/alerting rules into it

GiedriusS avatar Nov 22 '22 09:11 GiedriusS

RAM usage also dropped around ~20% with no noticeable increase in CPU usage :thinking:

image image

GiedriusS avatar Nov 22 '22 11:11 GiedriusS

cc @bboreham - want to check it? You might have some thoughts. (:

bwplotka avatar Nov 22 '22 13:11 bwplotka

Thanks for this; I have considered doing something like this in the remote-write protocol. A question that occurs to me is "when is a symbol table discarded from memory?", but perhaps I didn't understand.

Re some conversations we had at PromCon, you may like https://github.com/grafana/mimir-prometheus/pull/353.

bboreham avatar Nov 22 '22 15:11 bboreham

FWIW, idea somewhat related idea to others in this conversation - reusing strings in receiver https://github.com/thanos-io/thanos/pull/5899 - did not manage to finalize it yet though

matej-g avatar Nov 22 '22 15:11 matej-g

Thanks @bboreham sending me this way.

@GiedriusS I have a similar implementation of a symbols table, based originally on the interner we already use in remote write. Let me know if you'd like to contribute yours upstream so that it could be reused in remote write.

cstyan avatar Nov 24 '22 21:11 cstyan

Thanks for your review @matej-g, I have made adjustments according to your comments. Also, I have fixed one bug I noticed after deploying this in prod.

@bboreham the symbol table is built and discarded after a Series() call or, in other words, its lifetime/scope is the Select() call from the promql engine. I hope this answers your question. Perhaps we could improve this later on.

@cstyan by "upstream" do you mean Prometheus? Could you please post some links? Regardless, I'd prefer to merge this as-is and then we can work on further items.

GiedriusS avatar Nov 29 '22 10:11 GiedriusS

Hello ๐Ÿ‘‹ Looks like there was no activity on this amazing PR for the last 30 days. Do you mind updating us on the status? Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! ๐Ÿค— If there will be no activity in the next week, this issue will be closed (we can always reopen a PR if you get back to this!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Jan 07 '23 21:01 stale[bot]

Been running this for months, haven't spotted any problems. PTAL.

GiedriusS avatar Jan 09 '23 09:01 GiedriusS

This looks good overall, I'll take another look next week to make sure we don't miss something.

fpetkovski avatar Jan 13 '23 16:01 fpetkovski

Updated PR - merged main && made changes according to comments. Also, while running this in prod I've noticed that decompression takes a long time with queries that retrieve millions of series hence I've added concurrent decompression support that is tunable via a new flag. Also, now this functionality is disabled unless that flag is set to a value bigger than zero.

I can't imagine running Thanos in prod without this anymore because Queriers would get easily overwhelmed - we have thousands of recording rules and the labels completely dominate the % of bandwidth.

GiedriusS avatar Feb 01 '23 13:02 GiedriusS

How much does this feature help, any dashboards/metrics to share? @GiedriusS ๐Ÿ‘€

Updated: NVM I see the improvements!

yeya24 avatar Feb 02 '23 18:02 yeya24

@GiedriusS anything still blocking this? Or you just need to resolve conflicts?

matej-g avatar Mar 03 '23 08:03 matej-g

This will need to be reworked a bit after the latest changes because order became important now. It means that we need to somehow merge different symbol references from multiple StoreAPIs in such a way as to preserve order. I don't know how to do that, at the moment. The references will need to somehow be encoded in such a way as to preserve order.

GiedriusS avatar Mar 03 '23 09:03 GiedriusS

Hello ๐Ÿ‘‹ Looks like there was no activity on this amazing PR for the last 30 days. Do you mind updating us on the status? Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! ๐Ÿค— If there will be no activity in the next week, this issue will be closed (we can always reopen a PR if you get back to this!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar May 21 '23 19:05 stale[bot]

Closing for now as promised, let us know if you need this to be reopened! ๐Ÿค—

stale[bot] avatar Jun 18 '23 09:06 stale[bot]

Do we still want to continue this work?

yeya24 avatar Jun 18 '23 17:06 yeya24

+1 for continuing the work

MitchellJThomas avatar Sep 27 '23 17:09 MitchellJThomas

I have hesitations about adopting something that makes streaming large result sets even harder, personally. This is often desirable for intermediate results in distributed Query etc.

I'd also be worried about the impact this has on memory use. How and when is the symbol table pruned, and how is its memory capped + made visible for monitoring?

Nothing wrong with it being optional.

Have you considered a streaming friendly dictionary encoder with a LRU to cap memory use? It would only help with repeated labels within one response, not between responses, but should have significant benefits.

It might be then possible for client and server to maintain a pre-seeded decoder state to handle the most common labels shared between queries too.

The discussions in Prometheus recently about moving external labels to a separate message instead of merging them into the result label set would help a bit here too. If adopted.

(IMO the most interesting thing about a symbol table would be if it the series could be kept in partially symbolised form, not expanded during receipt. Then processed through the executor in that form, and only expanded on demand. The memory savings when Thanos accumulates huge matrices in memory could be enormous if many common labels on the series were factored out into a shared symbol table. Pruning becomes a challenge though, and the costs of the on demand decoding during sorting etc might prove too high.)

ringerc avatar Oct 05 '25 20:10 ringerc

I have hesitations about adopting something that makes streaming large result sets even harder, personally. This is often desirable for intermediate results in distributed Query etc.

I'd also be worried about the impact this has on memory use. How and when is the symbol table pruned, and how is its memory capped + made visible for monitoring?

Nothing wrong with it being optional.

Have you considered a streaming friendly dictionary encoder with a LRU to cap memory use? It would only help with repeated labels within one response, not between responses, but should have significant benefits.

It might be then possible for client and server to maintain a pre-seeded decoder state to handle the most common labels shared between queries too.

The discussions in Prometheus recently about moving external labels to a separate message instead of merging them into the result label set would help a bit here too. If adopted.

(IMO the most interesting thing about a symbol table would be if it the series could be kept in partially symbolised form, not expanded during receipt. Then processed through the executor in that form, and only expanded on demand. The memory savings when Thanos accumulates huge matrices in memory could be enormous if many common labels on the series were factored out into a shared symbol table. Pruning becomes a challenge though, and the costs of the on demand decoding during sorting etc might prove too high.)

I believe for that we would need to adjust & expand labels.Labels and historically there has been a push-back for that. https://github.com/prometheus/prometheus/blob/f4b8840f51bee8afa258e0d4e27fc7fc03358760/model/labels/labels_dedupelabels.go#L31 we could expose this but the reply was that "this is not a serialization format". I cannot find the exact PR with this comment right now.

GiedriusS avatar Oct 06 '25 07:10 GiedriusS