cloud-provider-aws "cloud-controller-manager" continuously OOMKilled

"cloud-controller-manager" continuously OOMKilled

Open cathyzhang05 opened this issue 1 year ago • 3 comments

Hi team,

The Control Plan component "Cloud Controller Manager" for one cluster continuously failing to restart due to pod "cloud-controller-manager-*" OOMKilled. The detail is shown below:

After the maintenance time of the shoot cluster, the Control Plan components restarted, then the pod of cloud-controller-manager would be OOMKilled and can't be started again.
Checked the memory for deployment cloud-controller-manager, it reached the maximum of memory limitation 2.6Gi. The pod of cloud-controller-manager is in CrashLoopBackOff status.
Restarted the cloud-controller-manager deployment, but the OOMKilled issue happened again without any suspicious error found. Sometimes, the logs stopped at the "EnsuredLoadBalancer" step as shown:

{"log":"\"Event occurred\" object=\"XXXX" kind=\"Service\" apiVersion=\"v1\" type=\"Normal\" reason=\"EnsuredLoadBalancer\" message=\"Ensured load balancer\"","pid":"1","severity":"INFO","source":"event.go:294"}

There are more than 110 nodes in this cluster, but I compared another cluster with more than 300 nodes which doesn't have the OOMKilled issue, the log is very similar when the pod started up.

The cluster version information:

K8s: 1.23.4
cloud-provider-aws: 1.24.0

Any idea or clue from your side? Any more information you need, just let me know. Thank you.

/triage support

Aug 16 '22 08:08 cathyzhang05

@cathyzhang05: The label(s) triage/support cannot be applied, because the repository doesn't have them.

In response to this:

Hi team,

The Control Plan component "Cloud Controller Manager" for one cluster continuously failing to restart due to pod "cloud-controller-manager-*" OOMKilled. The detail is shown below:

After the maintenance time of the shoot cluster, the Control Plan components restarted, then the pod of cloud-controller-manager would be OOMKilled and can't be started again.

Checked the memory for deployment cloud-controller-manager, it reached the maximum of memory limitation 2.6Gi. The pod of cloud-controller-manager is in CrashLoopBackOff status.

Restarted the cloud-controller-manager deployment, but the OOMKilled issue happened again without any suspicious error found. Sometimes, the logs stopped at the "EnsuredLoadBalancer" step as shown:
{"log":"\"Event occurred\" object=\"XXXX" kind=\"Service\" apiVersion=\"v1\" type=\"Normal\" reason=\"EnsuredLoadBalancer\" message=\"Ensured load balancer\"","pid":"1","severity":"INFO","source":"event.go:294"}
There are more than 110 nodes in this cluster, but I compared another cluster with more than 300 nodes which doesn't have the OOMKilled issue, the log is very similar when the pod started up.

The cluster version information:

K8s: 1.23.4

cloud-provider-aws: 1.24.0

Any idea or clue from your side? Any more information you need, just let me know. Thank you.

/triage support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 16 '22 08:08 k8s-ci-robot

@cathyzhang05: This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Aug 16 '22 08:08 k8s-ci-robot

I am working with @cathyzhang05 and got this profile from one of the OOM runs: profile.pb.gz. I hope this provides some extra data. To me it seems that the aws library has issues dealing efficiently with large API responses but I could be wrong. Have you seen that happen in the past ?

Aug 22 '22 06:08 kon-angelo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 20 '22 07:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Dec 20 '22 07:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jan 19 '23 08:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jan 19 '23 08:01 k8s-ci-robot

cloud-provider-aws cloud-provider-aws copied to clipboard

"cloud-controller-manager" continuously OOMKilled

cloud-provider-aws
cloud-provider-aws copied to clipboard