thanos icon indicating copy to clipboard operation
thanos copied to clipboard

store: Thanos consumes 40G at startup

Open fagossa opened this issue 3 years ago • 3 comments

Hello,

I'm having some uncontrolled memory consumption with thanos store.

What happened:

At start up, there is a peak of memory of ~40GiB that later decreases over time to ~26GiB. However, depending on the load, it reached back the peak value and even beyond (which means OEMKill). Capture d’écran 2022-05-16 à 17 16 56

even when i'm using a memcached precisely to avoid this situation.

I currently have more that 145k blocks in my s3 storage and more than 80 prometheus (+sidecars).

What I expected:

As per the use of a cache I expected the memory to stay low.

At least it seems the cache is used

Capture d’écran 2022-05-16 à 17 59 37

current configuration

thanos store \
--data-dir="/var/thanos/store" \
--objstore.config-file="/etc/thanos/bucket.yml" \
--http-address="0.0.0.0:10902" \
--grpc-address="0.0.0.0:10901" \
--log.format="json" \
--log.level="info" \
--store.index-header-posting-offsets-in-mem-sampling=50 \
--index-cache.config-file="/etc/thanos/thanos-store-cache-config.yml"

this is my cache configuration

type: MEMCACHED
config:
  addresses: ["memcached:11211"]
  timeout: 3s
  max_idle_connections: 200
  max_async_concurrency: 20
  max_async_buffer_size: 10000000
  max_item_size: 300MiB
  max_get_multi_concurrency: 5
  max_get_multi_batch_size: 20
  dns_provider_update_interval: 10s

Environment:

  • thanos 0.25.1
  • prometheus 2.28.1

What have I tested so far?

1 - sharding by date using flags

I created several stores for 3 months periods using this pattern

thanos store \
...
--max-time=-12w \
--min-time=-24w \

but, all stores ended-up consuming the same amount of memory.

2 - changing the compact/level

I basically followed this issue https://github.com/thanos-io/thanos/issues/325

Anything else

I don't know if this information can be useful but the amt of samples changes greatly depending on my prometheus

Capture d’écran 2022-05-16 à 17 53 20

fagossa avatar May 16 '22 16:05 fagossa

Do you have persistent storage on your Thanos Store pods (I assume k8s here)? The RAM usage probably comes from building binary index headers. Maybe there is some opportunity here to use sync.Pool to have a constant RAM usage. With persistent storage, you wouldn't have to rebuild them each time. Could you upload a profile of the memory usage of Thanos Store just after the start of the process?

GiedriusS avatar May 16 '22 20:05 GiedriusS

Hello, thanks a lot for your answer!

We are actually using docker swarm with persistent volumes. I do not think there is a problem at this level because we have hundreds of services (not related to prometheus+thanos) running without any issue.

As per the memory, here we have the memory for the store using the bucket with 145k blocks after a reboot

Capture d’écran 2022-05-17 à 10 29 20

and here we have a similar behavior of one of the stores sharded on three months (using min-time and max-time) after a reboot

Capture d’écran 2022-05-17 à 10 27 42

Note:

  • This might be unrelated but the thanos compact is down because there are some overlaps.

fagossa avatar May 17 '22 08:05 fagossa

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Jul 31 '22 04:07 stale[bot]