cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

"cloud-controller-manager" continuously OOMKilled

Open cathyzhang05 opened this issue 1 year ago • 3 comments

Hi team,

The Control Plan component "Cloud Controller Manager" for one cluster continuously failing to restart due to pod "cloud-controller-manager-*" OOMKilled. The detail is shown below:

  • After the maintenance time of the shoot cluster, the Control Plan components restarted, then the pod of cloud-controller-manager would be OOMKilled and can't be started again.

  • Checked the memory for deployment cloud-controller-manager, it reached the maximum of memory limitation 2.6Gi. The pod of cloud-controller-manager is in CrashLoopBackOff status. Screen Shot 2022-08-16 at 15 23 56

  • Restarted the cloud-controller-manager deployment, but the OOMKilled issue happened again without any suspicious error found. Sometimes, the logs stopped at the "EnsuredLoadBalancer" step as shown:

{"log":"\"Event occurred\" object=\"XXXX" kind=\"Service\" apiVersion=\"v1\" type=\"Normal\" reason=\"EnsuredLoadBalancer\" message=\"Ensured load balancer\"","pid":"1","severity":"INFO","source":"event.go:294"}
  • There are more than 110 nodes in this cluster, but I compared another cluster with more than 300 nodes which doesn't have the OOMKilled issue, the log is very similar when the pod started up.

The cluster version information:

  • K8s: 1.23.4
  • cloud-provider-aws: 1.24.0

Any idea or clue from your side? Any more information you need, just let me know. Thank you.

/triage support

cathyzhang05 avatar Aug 16 '22 08:08 cathyzhang05

@cathyzhang05: The label(s) triage/support cannot be applied, because the repository doesn't have them.

In response to this:

Hi team,

The Control Plan component "Cloud Controller Manager" for one cluster continuously failing to restart due to pod "cloud-controller-manager-*" OOMKilled. The detail is shown below:

  • After the maintenance time of the shoot cluster, the Control Plan components restarted, then the pod of cloud-controller-manager would be OOMKilled and can't be started again.

  • Checked the memory for deployment cloud-controller-manager, it reached the maximum of memory limitation 2.6Gi. The pod of cloud-controller-manager is in CrashLoopBackOff status. Screen Shot 2022-08-16 at 15 23 56

  • Restarted the cloud-controller-manager deployment, but the OOMKilled issue happened again without any suspicious error found. Sometimes, the logs stopped at the "EnsuredLoadBalancer" step as shown:

{"log":"\"Event occurred\" object=\"XXXX" kind=\"Service\" apiVersion=\"v1\" type=\"Normal\" reason=\"EnsuredLoadBalancer\" message=\"Ensured load balancer\"","pid":"1","severity":"INFO","source":"event.go:294"}
  • There are more than 110 nodes in this cluster, but I compared another cluster with more than 300 nodes which doesn't have the OOMKilled issue, the log is very similar when the pod started up.

The cluster version information:

  • K8s: 1.23.4
  • cloud-provider-aws: 1.24.0

Any idea or clue from your side? Any more information you need, just let me know. Thank you.

/triage support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 16 '22 08:08 k8s-ci-robot

@cathyzhang05: This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 16 '22 08:08 k8s-ci-robot

I am working with @cathyzhang05 and got this profile from one of the OOM runs: profile.pb.gz. I hope this provides some extra data. To me it seems that the aws library has issues dealing efficiently with large API responses but I could be wrong. Have you seen that happen in the past ?

kon-angelo avatar Aug 22 '22 06:08 kon-angelo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 20 '22 07:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 20 '22 07:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jan 19 '23 08:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 19 '23 08:01 k8s-ci-robot