karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Publishing failed pod schedule events can lead to etcd overflow

Open bacek opened this issue 9 months ago • 8 comments

We are having multiple NodePools in our system for better resource isolation and different requirements and do processing with large amount of kubernetes Jobs. Scheduling multiple Pods will trigger scaling up of different NodePools. Karpenter will iterate over all existing NodePools to find correct one. But for each incompatible NodePool it will emit error message:

https://github.com/kubernetes-sigs/karpenter/blob/7bf31e553f390111058d16b6cd5745ed144d3de8/pkg/controllers/provisioning/scheduling/scheduler.go#L405

This message in turn will be emitted as k8s Event in:

https://github.com/kubernetes-sigs/karpenter/blob/7bf31e553f390111058d16b6cd5745ed144d3de8/pkg/events/recorder.go#L86

With multiple NodePools, each message will be in order of 10-20 kilobytes. And having ~10k pods will overflow etcd database.

bacek avatar Mar 13 '25 03:03 bacek

Correction: in our case events were ~76KB.

bacek avatar Mar 13 '25 03:03 bacek

And having ~10k pods will overflow etcd database

How does 76Kb overflow the DB? That seems like a pretty small number to me. Also, containerd and kube-scheduler itself emits more events generally than Karpenter does so I'm curious if you are seeing the same issue with that component.

jonathan-innis avatar Mar 13 '25 16:03 jonathan-innis

And having ~10k pods will overflow etcd database

How does 76Kb overflow the DB? That seems like a pretty small number to me. Also, containerd and kube-scheduler itself emits more events generally than Karpenter does so I'm curious if you are seeing the same issue with that component.

We don't have enough compute to provision fall all pods. So we have some long running jobs while we utilize all available nodes.

So state of etcd overflow we had this:

kubectl get --raw=/metrics | grep apiserver_storage_objects |awk '$2>100' |sort -g -k 2
# HELP apiserver_storage_objects [STABLE] Number of stored objects at the time of last check split by kind. In case of a fetching error, the value will be -1.
# TYPE apiserver_storage_objects gauge
apiserver_storage_objects{resource="virtualservices.networking.istio.io"} 110
apiserver_storage_objects{resource="roles.rbac.authorization.k8s.io"} 112
apiserver_storage_objects{resource="deployments.apps"} 126
apiserver_storage_objects{resource="rolebindings.rbac.authorization.k8s.io"} 126
apiserver_storage_objects{resource="controllerrevisions.apps"} 130
apiserver_storage_objects{resource="clusterrolebindings.rbac.authorization.k8s.io"} 148
apiserver_storage_objects{resource="clusterroles.rbac.authorization.k8s.io"} 174
apiserver_storage_objects{resource="endpoints"} 188
apiserver_storage_objects{resource="endpointslices.discovery.k8s.io"} 202
apiserver_storage_objects{resource="serviceaccounts"} 205
apiserver_storage_objects{resource="services"} 229
apiserver_storage_objects{resource="configmaps"} 269
apiserver_storage_objects{resource="nodeclaims.karpenter.sh"} 278
apiserver_storage_objects{resource="cninodes.vpcresources.k8s.aws"} 280
apiserver_storage_objects{resource="csinodes.storage.k8s.io"} 280
apiserver_storage_objects{resource="nodes"} 280
apiserver_storage_objects{resource="leases.coordination.k8s.io"} 337
apiserver_storage_objects{resource="certificatesigningrequests.certificates.k8s.io"} 343
apiserver_storage_objects{resource="secrets"} 391
apiserver_storage_objects{resource="replicasets.apps"} 547
apiserver_storage_objects{resource="jobs.batch"} 1744
apiserver_storage_objects{resource="pods"} 4596
apiserver_storage_objects{resource="events"} 955190

So, 4.5k pods generated almost 1M events. Which translates to 76GB worth of events.

bacek avatar Mar 16 '25 23:03 bacek

/triage needs-investigation /priority needs-more-evidence

jmdeal avatar Mar 17 '25 16:03 jmdeal

@jmdeal: The label(s) priority/needs-more-evidence cannot be applied, because the repository doesn't have them.

In response to this:

/triage needs-investigation /priority needs-more-evidence

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Mar 17 '25 16:03 k8s-ci-robot

Is this on a self-managed cluster?

rschalo avatar Mar 26 '25 02:03 rschalo

Is this on a self-managed cluster?

No, it's EKS.

bacek avatar Mar 26 '25 02:03 bacek

Is there any news about this? Our solution has been to build a fork limiting the msg size (and therefore making it useless). We keep reaching the etcd limit, and this karpenter msg is the main culprit.

JonCholas avatar Apr 02 '25 23:04 JonCholas

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 02 '25 00:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 01 '25 00:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Aug 31 '25 00:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 31 '25 00:08 k8s-ci-robot