[BUG] Cortex v1.18.0 Upgrade Causing OOMKills and CPU Spikes in Store-Gateway
Describe the bug Following the upgrade of Cortex from v1.17.1 to v1.18.0, the Store Gateway Pods are frequently encountering OOMKills. These events appear random, occurring approximately every 5 minutes, and have continued beyond the upgrade. Before the upgrade, memory usage consistently hovered around 4GB, with CPU usage under 1 core. However, after the upgrade, both CPU and memory usage have spiked to over 10 times their typical levels. Even after increasing the memory limit for the Store Gateway to 30GB, the issue persists. (see graph below)
We initially suspected the issue might be related to the sharding ring configurations, so we attempted to disable the following flags:
- store-gateway.sharding-ring.zone-awareness-enabled=False
- store-gateway.sharding-ring.zone-stable-shuffle-sharding=False However, this did not resolve the problem.
CPU Graph: The far left shows usage before the upgrade, the middle represents usage during the upgrade, and the far right illustrates the rollback, where CPU usage returns to normal levels-
Memory Graph: The far left shows memory usage before the upgrade, the middle represents usage during the upgrade, and the far right reflects the rollback, where memory usage returns to normal levels-
To Reproduce Steps to reproduce the behavior:
- Upgrade to Cortex v1.18.0 from v1.17.1 using the Cortex Helm Chart with the values in the Additional Context section.
Expected behavior Store-GW shouldn't be OOMKilling.
Environment:
- Infrastructure: AKS(Kubernetes)
- Deployment tool: Cortex Helm Chart v2.3.0 or v2.4.0
Additional Context
Helm Chart Values Passed
useExternalConfig: true
image:
repository: redact
tag: v1.18.0
externalConfigVersion: x
ingress:
enabled: true
ingressClass:
enabled: true
name: nginx
hosts:
- host: cortex.redact
paths:
- /
tls:
- hosts:
- cortex.redact
serviceAccount:
create: true
automountServiceAccountToken: true
store_gateway:
replicas: 6
persistentVolume:
storageClass: premium
size: 64Gi
resources:
resources:
limits:
memory: 24Gi
requests:
memory: 18Gi
extraArgs:
blocks-storage.bucket-store.index-cache.memcached.max-async-buffer-size: "10000000"
blocks-storage.bucket-store.index-cache.memcached.max-get-multi-concurrency: "100"
blocks-storage.bucket-store.index-cache.memcached.max-get-multi-batch-size: "100"
blocks-storage.bucket-store.bucket-index.enabled: true
blocks-storage.bucket-store.index-header-lazy-loading-enabled: true
store-gateway.sharding-ring.zone-stable-shuffle-sharding: False
store-gateway.sharding-ring.zone-awareness-enabled: False
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
compactor:
persistentVolume:
size: 256Gi
storageClass: premium
resources:
limits:
cpu: 4
memory: 10Gi
requests:
cpu: 1.5
memory: 5Gi
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
extraArgs:
blocks-storage.bucket-store.bucket-index.enabled: true
nginx:
replicas: 3
image:
repository: redact
tag: 1.27.2-alpine-slim
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
resources:
limits:
cpu: 500m
memory: 500Mi
requests:
cpu: 500m
memory: 500Mi
config:
verboseLogging: false
query_frontend:
replicas: 3
resources:
limits:
cpu: 1
memory: 5Gi
requests:
cpu: 200m
memory: 4Gi
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
extraArgs:
querier.query-ingesters-within: 8h
querier:
replicas: 3
resources:
limits:
cpu: 8
memory: 26Gi
requests:
cpu: 1
memory: 20Gi
extraArgs:
querier.query-ingesters-within: 8h
querier.max-fetched-data-bytes-per-query: "2147483648"
querier.max-fetched-chunks-per-query: "1000000"
querier.max-fetched-series-per-query: "200000"
querier.max-samples: "50000000"
blocks-storage.bucket-store.bucket-index.enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
ingester:
statefulSet:
enabled: true
replicas: 18
persistentVolume:
enabled: true
size: 64Gi
storageClass: premium
resources:
limits:
cpu: 8
memory: 45Gi
requests:
cpu: 8
memory: 40Gi
extraArgs:
ingester.max-metadata-per-user: "50000"
ingester.max-series-per-metric: "200000"
ingester.instance-limits.max-series: "0"
ingester.ignore-series-limit-for-metric-names: "redact"
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
ruler:
validation:
enabled: false
replicas: 3
resources:
limits:
cpu: 2
memory: 6Gi
requests:
cpu: 500m
memory: 3Gi
sidecar:
image:
repository: redact
tag: 1.28.0
resources:
limits:
cpu: 1
memory: 200Mi
requests:
cpu: 50m
memory: 100Mi
enabled: true
searchNamespace: cortex-rules
folder: /tmp/rules
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
extraArgs:
blocks-storage.bucket-store.bucket-index.enabled: true
querier.max-fetched-chunks-per-query: "2000000"
alertmanager:
enabled: true
replicas: 3
podAnnotations:
configmap.reloader.stakater.com/reload: "redact"
statefulSet:
enabled: true
persistentVolume:
size: 8Gi
storageClass: premium
sidecar:
image:
repository: redact
tag: 1.28.0
containerSecurityContext:
enabled: true
runAsUser: 0
resources:
limits:
cpu: 100m
memory: 200Mi
requests:
cpu: 50m
memory: 100Mi
enabled: true
searchNamespace: cortex-alertmanager
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
distributor:
resources:
limits:
cpu: 4
memory: 10Gi
requests:
cpu: 2
memory: 10Gi
extraArgs:
distributor.ingestion-rate-limit: "120000"
validation.max-label-names-per-series: 40
distributor.ha-tracker.enable-for-all-users: true
distributor.ha-tracker.enable: true
distributor.ha-tracker.failover-timeout: 30s
distributor.ha-tracker.cluster: "prometheus"
distributor.ha-tracker.replica: "prometheus_replica"
distributor.ha-tracker.consul.hostname: consul.cortex:8500
distributor.instance-limits.max-ingestion-rate: "120000"
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
autoscaling:
minReplicas: 15
maxReplicas: 30
memcached-frontend:
enabled: true
image:
registry: redact
repository: redact/memcached-bitnami
tag: redact
commonLabels:
release: kube-prometheus-stack
podManagementPolicy: OrderedReady
metrics:
enabled: true
image:
registry: redact
repository: redact/memcached-exporter-bitnami
tag: redact
serviceMonitor:
enabled: true
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
resources:
requests:
memory: 1Gi
cpu: 1
limits:
memory: 1.5Gi
cpu: 1
args:
- /run.sh
- -I 32m
serviceAccount:
create: true
memcached-blocks-index:
enabled: true
image:
registry: redact
repository: redact/memcached-bitnami
tag: redact
commonLabels:
release: kube-prometheus-stack
podManagementPolicy: OrderedReady
metrics:
enabled: true
image:
registry: redact
repository: redact/memcached-exporter-bitnami
tag: redact
serviceMonitor:
enabled: true
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
resources:
requests:
memory: 1Gi
cpu: 1
limits:
memory: 1.5Gi
cpu: 1.5
args:
- /run.sh
- -I 32m
serviceAccount:
create: true
memcached-blocks:
enabled: true
image:
registry: redact
repository: redact/memcached-bitnami
tag: redact
commonLabels:
release: kube-prometheus-stack
podManagementPolicy: OrderedReady
metrics:
enabled: true
image:
registry: redact
repository: redact/memcached-exporter-bitnami
tag: redact
serviceMonitor:
enabled: true
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
resources:
requests:
memory: 2Gi
cpu: 1
limits:
memory: 3Gi
cpu: 1
args:
- /run.sh
- -I 32m
serviceAccount:
create: true
memcached-blocks-metadata:
enabled: true
image:
registry: redact
repository: redact/memcached-bitnami
tag: redact
commonLabels:
release: kube-prometheus-stack
podManagementPolicy: OrderedReady
metrics:
enabled: true
image:
registry: redact
repository: redact/memcached-exporter-bitnami
tag: redact
serviceMonitor:
enabled: true
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
resources:
requests:
memory: 1Gi
cpu: 1
limits:
memory: 1.5Gi
cpu: 1
args:
- /run.sh
- -I 32m
serviceAccount:
create: true
runtimeconfigmap:
create: true
annotations: {}
runtime_config: {}
Quick PPROF of Store GW
curl -s http://localhost:8080/debug/pprof/heap > heap.out
go tool pprof heap.out
top
Showing nodes accounting for 622.47MB, 95.80% of 649.78MB total
Dropped 183 nodes (cum <= 3.25MB)
Showing top 10 nodes out of 49
flat flat% sum% cum cum%
365.95MB 56.32% 56.32% 365.95MB 56.32% github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init.func3
127.94MB 19.69% 76.01% 528.48MB 81.33% github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init
76.30MB 11.74% 87.75% 76.30MB 11.74% github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor
34.59MB 5.32% 93.07% 34.59MB 5.32% github.com/prometheus/prometheus/tsdb/index.NewSymbols
Hi @dpericaxon,
Thanks for filing the issue.
I was looking at the pprof attached in the issue and noticed that LabelValues looked like interesting.
github.com/thanos-io/thanos/pkg/block/indexheader.(*LazyBinaryReader).LabelValues
Something that changed in between 1.17.1 and 1.18.0 is this
- [CHANGE] Ingester: Remove -querier.query-store-for-labels-enabled flag. Querying long-term store for labels is always enabled. #5984
I don't see this flag being set in your values file for the queriers that enabled it before the upgrade:
querier.query-ingesters-within: 8h
querier.max-fetched-data-bytes-per-query: "2147483648"
querier.max-fetched-chunks-per-query: "1000000"
querier.max-fetched-series-per-query: "200000"
querier.max-samples: "50000000"
blocks-storage.bucket-store.bucket-index.enabled: true
I have a feeling that since it is always enabled, the label values are being returned for the entire time range instead of just the instant that the query was run.
Could you try setting querier.query-store-for-labels-enabled: true in 1.17.1 in your set up and seeing if the issue happens?
It can indeed be because of that flag.. good catch @CharlieTLe
maybe we should default the series/label names apis to query the last 24 hours if the time range is not specified ?
I think we should be able to set a limit for how many label values can be queried so that even if a long time range is specified, it doesn't cause the store-gateway to use too much memory.
There is an effort to limit this but it may not be straight forward as this limit can only be applied after querying the index (and for those particular apis, this is all the work)
Should we add the flag to restore the previous behavior until a limit can be set on the maximum number of label values that could be fetched? Or perhaps setting an execution time limit on the fetching so that it can be cancelled if it's taking longer than a specified duration?
I think this specific API call is mostly used by query builders for making auto complete possible?
I don't think the heap usage increased was caused by label values request. If you look at the heap profile, it was used by the binary index header part, which is expected as Store Gateway caches blocks' symbols, and some postings. And the heap profile provided may not capture what took memory as it was only 600MBs.
I recommend taking another heap dump from a Store Gateway where you observe high memory usage.
Thank you @CharlieTLe and @yeya24 for your suggestions
-
We first tried setting querier.query-store-for-labels-enabled: true in version 1.17.1. After making this change, we observed that the Store Gateway Pods started frequently encountering OOMKills, with both CPU and memory usage spiking far beyond their usual levels.
-
Since we were able to reproduce the issue with querier.query-store-for-labels-enabled: true, we decided to set it to false and then upgraded to v1.18.0. Unfortunately, even with querier.query-store-for-labels-enabled: false, the Store Gateway Pods continued encountering OOMKills, and CPU and memory usage spiked again.
CPU and memory spike after setting to false and upgrade to v1.18.0
Here’s a quick PPROF of the Store Gateway during one of these OOM incidents:
(pprof) top
Showing nodes accounting for 975.10MB, 95.68% of 1019.09MB total
Dropped 206 nodes (cum <= 5.10MB)
Showing top 10 nodes out of 66
flat flat% sum% cum cum%
464.82MB 45.61% 45.61% 464.82MB 45.61% github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init.func3
178.29MB 17.50% 63.11% 683.35MB 67.05% github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init
129.10MB 12.67% 75.78% 129.10MB 12.67% github.com/thanos-io/thanos/pkg/pool.NewBucketedBytes.func1
76.84MB 7.54% 83.32% 76.84MB 7.54% github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor
64.77MB 6.36% 89.67% 65.77MB 6.45% github.com/bradfitz/gomemcache/memcache.parseGetResponse
40.23MB 3.95% 93.62% 40.23MB 3.95% github.com/prometheus/prometheus/tsdb/index.NewSymbols
13.94MB 1.37% 94.99% 13.94MB 1.37% github.com/klauspost/compress/s2.NewWriter.func1
4.10MB 0.4% 95.39% 687.45MB 67.46% github.com/thanos-io/thanos/pkg/block/indexheader.newFileBinaryReader
1.50MB 0.15% 95.54% 5.55MB 0.54% github.com/thanos-io/thanos/pkg/store.(*blockSeriesClient).nextBatch
1.50MB 0.15% 95.68% 35.02MB 3.44% github.com/thanos-io/thanos/pkg/store.populateChunk
Hi @elliesaber,
Unfortunately, setting querier.query-store-for-labels-enabled: false in v1.18.0 does not disable querying the store-gateway for labels since the flag was removed in #5984.
We could bring the flag back by reverting #5984. I'm not really sure why we decided to remove this flag instead of setting its default to true. Adding the flag back could help with users that are looking to upgrade to 1.18.0 without querying the store gateway for labels.
Thank you @CharlieTLe for the suggestion.
I agree that being able to set querier.query-store-for-labels-enabled manually instead of relying on the default behavior would be helpful for us. Reverting the flag and allowing users to control whether or not to query the store gateway for labels would give us more flexibility. This would likely prevent the significant CPU and memory spikes that are leading to OOMKills and help smooth the upgrade process to v1.18.0. We’d appreciate this addition as it would enable us to upgrade without running into these memory issues.
(pprof) top
Showing nodes accounting for 975.10MB, 95.68% of 1019.09MB total
Dropped 206 nodes (cum <= 5.10MB)
Showing top 10 nodes out of 66
flat flat% sum% cum cum%
464.82MB 45.61% 45.61% 464.82MB 45.61% github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init.func3
178.29MB 17.50% 63.11% 683.35MB 67.05% github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init
129.10MB 12.67% 75.78% 129.10MB 12.67% github.com/thanos-io/thanos/pkg/pool.NewBucketedBytes.func1
76.84MB 7.54% 83.32% 76.84MB 7.54% github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor
64.77MB 6.36% 89.67% 65.77MB 6.45% github.com/bradfitz/gomemcache/memcache.parseGetResponse
40.23MB 3.95% 93.62% 40.23MB 3.95% github.com/prometheus/prometheus/tsdb/index.NewSymbols
13.94MB 1.37% 94.99% 13.94MB 1.37% github.com/klauspost/compress/s2.NewWriter.func1
4.10MB 0.4% 95.39% 687.45MB 67.46% github.com/thanos-io/thanos/pkg/block/indexheader.newFileBinaryReader
1.50MB 0.15% 95.54% 5.55MB 0.54% github.com/thanos-io/thanos/pkg/store.(*blockSeriesClient).nextBatch
1.50MB 0.15% 95.68% 35.02MB 3.44% github.com/thanos-io/thanos/pkg/store.populateChunk
I don't think the heap dump above shows the issue was label values touching store gateway. The heap dump was probably not at the right time as your memory usage showed that it could go to 48GB.
For the memory usage metric, are you using the container_working_set_bytes metric or the heap size metric.
Another thing that might help with the issue is setting GOMEMLIMIT. But we need to understand the root cause of the OOM kill first.
This message seems pretty telling that it is caused by the behavior controlled by the flag querier.query-store-for-labels-enabled.
We first tried setting querier.query-store-for-labels-enabled: true in version 1.17.1. After making this change, we observed that the Store Gateway Pods started frequently encountering OOMKills, with both CPU and memory usage spiking far beyond their usual levels.
If we ignored the heap dump, it does seem possible that there is a label with a very high cardinality. If there is no limit to how many label values could be queried, I could imagine that the store-gateway could be overwhelmed with fetching all of the values possible for a label.
@yeya24 we used container_memory_working_set_bytes in the graph screenshot you see
Thanks and sorry for the late response. @elliesaber How does metric go_memstats_alloc_bytes looks like? This is your heap size.
If you confirmed that the OOM kill was caused by query-store-for-labels-enabled change, I think we can add the flag back as it break user experience.
Hey @yeya24 we believe its related to that flag. This is what the go_memstats_alloc_bytes looked like for the different store-gateways. Let me know if the image below helps or if you need more info or anything clearer!
@dpericaxon I don't think the graph showed that the flag is related. It looks more related to a deployment.
Do you have any API requests that ask for label names/values at the time of the spikes? The flag is related to those labels API so we need evidence to prove that the API caused the memory increase. You can reproduce this by calling the API manually yourself.
Hey @yeya24,
I observed the issue immediately after updating Cortex to v1.18.0. Specifically, as soon as the first store-gateway instance rolls out during the upgrade, the store-gateways begin experiencing OOMKills.
Key Observations:
-
Initial Store-Gateway Rollout:
- Here's a screenshot showing that only
cortex-store-gateway-5had been updated so far. You can see non-updated store-gateways experience OOMkills as well.:
- Here's a screenshot showing that only
-
Errors on Non-Updated Store-Gateways:
- Non-updated
store-gatewaysdisplayed repeated errors related to reading from Memcached, even though Memcached itself hadn't been updated or changed during the upgrade. For example:2024-11-20T10:40:02.446-05:00 ts=2024-11-20T15:40:02.446165605Z caller=grpc_logging.go:74 level=warn method=/gatewaypb.StoreGateway/Series duration=1.187062844s err="rpc error: code = Unknown desc = send series response: rpc error: code = Canceled desc = context canceled" msg=gRPC 2024-11-20T10:40:17.375-05:00 ts=2024-11-20T15:40:17.374862465Z caller=memcached_client.go:438 level=warn name=index-cache msg="failed to fetch items from memcached" numKeys=139 firstKey=S:0redact:100543056 err="memcache: connect timeout to redact:11211" 2024-11-20T10:40:44.272-05:00 ts=2024-11-20T15:40:44.272682984Z caller=memcached_client.go:438 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=2 firstKey=subrange:fake/0redact/chunks/000146:336960000:336976000 err="failed to wait for turn. Instance: : context canceled"- Frequent errors included:
-
"failed to fetch items from memcached" -
rpc error: code = Unknown desc = send series response
-
- Frequent errors included:
- Non-updated
-
Query Frontend API Call and Response:
- I tested with a single
querier-frontendusing:curl -X GET 'http://localhost:62308/prometheus/api/v1/labels' - The first run returned:
{"status":"error","errorType":"execution","error":"consistency check failed because some blocks were not queried:<blocks>"} - This occurred either right before or simultaneously with the OOMKills on the
store-gateways. - The second attempt at running the command curling a single querier-frontend with
curl -X GET 'http://localhost:62308/prometheus/api/v1/labels'- Didn't trigger the oomkills.
- It seems like whenever this issue occurs, there's a spike in error messages saying "failed to fetch items from memcached" on the store-gateways:
- I tested with a single
-
Store-Gateway Logs:
- During this period, logs on the
store-gatewayinstances repeatedly showed"failed to fetch items from memcached"errors.
- During this period, logs on the
-
Recovery After OOMKills:
- Post-OOMKills, all
store-gatewayseventually recovered but then several minutes later the same thing happens again.
- Post-OOMKills, all
Memory Spikes and Metrics:
-
Spikes in Memory Usage:
- Here’s a screenshot showing the memory spikes in
store-gatewaysstarting at 10:40 AM EST, coinciding with the upgrade:
- Here’s a screenshot showing the memory spikes in
-
Go Memory Stats:
- The
go_memstats_alloc_bytesmetric also shows a significant spike after the upgrade:
- The
Questions:
- Could this issue be related to the API usage? Am I querying the correct API?
- Is there a chance some larger, subsequent request triggered by this API call is causing the memory spike?
- Could there be an underlying change in v1.18.0 affecting Memcached interactions, or perhaps a different factor causing the memory pressure?
Let me know if you need further details or additional logs to debug this issue.
Hey @yeya24 was the above helpful? Was there other information you need?
Hey @yeya24 my apologies for tagging you again but did you have a chance to look at the above(when you get a chance please)?
Sorry for the late reply. I think this issue is not related to the changes on labels API as you didn't see OOM kill triggered by them.
From what you described, is the OOM kill related to rollout only? If you give enough memory or increase store gateway replicas for it to not OOM kill. Will you see periodic memory usage spike? Did you see corresponding CPU usage spike as well?
It would be good to understand what store gateways are doing at that time. Can you please take some CPU profiles if you do see CPU usage spikes?
Hey @yeya24,
We were able to upgrade to v1.19.0 and get past this by:
-
Reducing
-blocks-storage.bucket-store.index-cache.memcached.max-async-buffer-sizefrom10000000to10000 - Scaling up the number of Memcached pods
-
Setting
-blocks-storage.bucket-store.index-cache.memcached.max-idle-connectionsto100for both the index-cache and chunks-cache -
Increasing
-blocks-storage.bucket-store.index-cache.memcached.max-async-concurrencyto15
That got us past the upgrade, but we’ve since noticed:
- Frequent CPU spikes
- Occasional memory spikes
on the store-gateways. Do you have any recommendations for tuning store-gateway settings to reduce these spikes, or is this expected behavior in v1.19.0?
Below is a 24-hour CPU/Memory usage chart. We’re running 5 store-gateway pods, each with:
- Requests: 6 vCPU / 50 GB memory
- Limits: 12 vCPU / 60 GB