thanos icon indicating copy to clipboard operation
thanos copied to clipboard

sidecar, stores, and query container memory spikes

Open grimz-ly opened this issue 1 year ago • 6 comments

**Thanos v0.33.0 **:

thanosio/thanos:v0.33.0

EMC OneFS S3 buckets

Prometheus Setup:

  • 2 hero hosts scraping metrics from 7 or so sidekick servers
  • sidecar container on each hero host reading blocks from the prometheus data folder and adding blocks to the S3 storage

Thanos Setup:

  • 2 hosts each running query, query frontend, compactor, redis, and a two store instances
  • the query container reads from the 2 store instances (which are S3 buckets written to from the 2 hero hosts), and the sidecar instance from the 2 hero hosts (for the locally stored prometheus tsdb data)

There is a RR load balanced nginx configuration proxying the metrics from the two frontend query instances as well.

Just yesterday put the URL in play as datasource in our Grafana instance to have the two Thanos hosts serve up all of our metrics. Since that time, I'm seeing occasional spikes in RAM usage from sidecar (prometheus host) as well as the query, and store containers (thanos host)

I fully expected there to be some load occurring when putting the querier to work with Grafana, but didn't expect to see memory spikes quite that high from the query component.

I suspect it's a specific query from Grafana more than number of users using dashboards. I also have no specific memory limits set yet on the Thanos containers.

Sorry that is all I have to go on at this time though maybe I can go back and find nginx logs to see if there was a specific query done when these spikes occur.

Any thoughts on the matter would be appreciated.

image

grimz-ly avatar Jan 26 '24 19:01 grimz-ly

I've seen the same yesterday. The thanos-sidecar started to eat all the memory available on the node, until OOM and crash.

image

It's the very first time we're noticing that behavior. The logs didn't show anything out of the ordinary.

We're also running 0.33.

maxdec avatar Feb 09 '24 09:02 maxdec

I'm also seeing the same issue. Does upgrading to 0.34.x fix this? We upgraded from 0.31.0 to 0.33.0 and started noticing it a couple days later.

b-lancaster avatar Feb 26 '24 14:02 b-lancaster

still seeing this with 0.34.1. I'm convinced it is a bad query or two being called from a Grafana panel, but not sure how to track it down exactly.

grimz-ly avatar Apr 22 '24 17:04 grimz-ly

still seeing this with 0.34.1. I'm convinced it is a bad query or two being called from a Grafana panel, but not sure how to track it down exactly.

Can you try configuring:

      --query.active-query-path=""
                                 Directory to log currently active queries in
                                 the queries.active file.

? On restart of the querier it should log which queries were active when it crashed hopefully.

MichaHoffmann avatar Apr 22 '24 18:04 MichaHoffmann

Any fix or suggestion to avoid this issue?

nish5uec avatar Jul 29 '24 12:07 nish5uec

Usually it is just because of the query you ran. When you query a large amount of data then for sure you can see memory spikes. To avoid OOM kills, try setting GOMEMLIMIT environment variables to protect Thanos components.

yeya24 avatar Aug 04 '24 03:08 yeya24