kube-state-metrics icon indicating copy to clipboard operation
kube-state-metrics copied to clipboard

Kube-state-metrics 20x spikes in memory usage at restart

Open zhoujoetan opened this issue 1 year ago • 2 comments

What happened: A few of our kube-state-metrics instance (single-instance, no sharding) recently had OOM issues after restart. The memory usage spiked up to 2.5GB (see attachment) for a few minutes before stabilized at 131 MB. We tried to increase the CPU limit from the default 0.1 to 1 or even 5, but it does not seem to help much.

Here is the pprof profile I captured:

File: kube-state-metrics
Type: inuse_space
Time: Jan 11, 2024 at 4:12pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1503.72MB, 99.10% of 1517.33MB total
Dropped 50 nodes (cum <= 7.59MB)
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
  753.11MB 49.63% 49.63%   753.11MB 49.63%  io.ReadAll
  748.12MB 49.30% 98.94%   748.12MB 49.30%  k8s.io/apimachinery/pkg/runtime.(*Unknown).Unmarshal
    2.49MB  0.16% 99.10%     8.99MB  0.59%  k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add
         0     0% 99.10%   753.11MB 49.63%  io/ioutil.ReadAll (inline)
         0     0% 99.10%   749.71MB 49.41%  k8s.io/apimachinery/pkg/runtime.WithoutVersionDecoder.Decode
         0     0% 99.10%   749.71MB 49.41%  k8s.io/apimachinery/pkg/runtime/serializer/protobuf.(*Serializer).Decode
         0     0% 99.10%     8.99MB  0.59%  k8s.io/apimachinery/pkg/util/wait.BackoffUntil
         0     0% 99.10%     8.99MB  0.59%  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
         0     0% 99.10%  1502.32MB 99.01%  k8s.io/client-go/kubernetes/typed/core/v1.(*configMaps).List
         0     0% 99.10%   753.61MB 49.67%  k8s.io/client-go/rest.(*Request).Do

Looks like heap memory usage does not represent 100% of the container_memory_usage_bytes metric.

What you expected to happen: memory usage to not spike 20x at restart

How to reproduce it (as minimally and precisely as possible): Kill/restart the KSM pod.

Anything else we need to know?:

Environment:

  • kube-state-metrics version: v2.3.0
  • Kubernetes version (use kubectl version): v1.24.17-eks-8cb36c9
  • Cloud provider or hardware configuration: AWS EKS
  • Other info:

Untitled

zhoujoetan avatar Jan 12 '24 19:01 zhoujoetan

/triage accepted /assign @rexagod

dgrisonnet avatar Jan 25 '24 17:01 dgrisonnet

@zhoujoetan try excluding configmaps and secrets from the list of exported resources (--resources= command line option). At least for me it dropped initial memory usage from ~400Mib to 24Mib.

Both CLI and helm chart have them included by default.

In my case Helm charts (which store the manifests in secrets by defaults) history were the main culprit for the I could not confirm is pagination is used which may mitigate this issue.

Hope this helps.

mindw avatar Feb 06 '24 13:02 mindw

kube-state-metrics version: v2.3.0

@zhoujoetan It seems you're on an outdated version that's no longer supported. Could you switch to one of the supported versions (preferably the latest release) and verify this issue still persists for you?

rexagod avatar Feb 25 '24 20:02 rexagod

I have figured out the issue. We had a ton of configmaps objects that KSM read during startup time. Trimming those objects brought the memory usage back down. I am closing the issue now.

zhoujoetan avatar Feb 27 '24 19:02 zhoujoetan

@zhoujoetan When you sam Configmap objects, you mean this cluste-rwide or was it something specific? Also when you say trimming was it like deleting unwanted CMs, or removing data from these CMs?

nalshamaajc avatar Mar 13 '24 22:03 nalshamaajc