cloud-provider-aws
cloud-provider-aws copied to clipboard
"cloud-controller-manager" continuously OOMKilled
Hi team,
The Control Plan component "Cloud Controller Manager" for one cluster continuously failing to restart due to pod "cloud-controller-manager-*" OOMKilled. The detail is shown below:
-
After the maintenance time of the shoot cluster, the Control Plan components restarted, then the pod of cloud-controller-manager would be OOMKilled and can't be started again.
-
Checked the memory for deployment cloud-controller-manager, it reached the maximum of memory limitation 2.6Gi. The pod of cloud-controller-manager is in CrashLoopBackOff status.
-
Restarted the cloud-controller-manager deployment, but the OOMKilled issue happened again without any suspicious error found. Sometimes, the logs stopped at the "EnsuredLoadBalancer" step as shown:
{"log":"\"Event occurred\" object=\"XXXX" kind=\"Service\" apiVersion=\"v1\" type=\"Normal\" reason=\"EnsuredLoadBalancer\" message=\"Ensured load balancer\"","pid":"1","severity":"INFO","source":"event.go:294"}
- There are more than 110 nodes in this cluster, but I compared another cluster with more than 300 nodes which doesn't have the OOMKilled issue, the log is very similar when the pod started up.
The cluster version information:
- K8s: 1.23.4
- cloud-provider-aws: 1.24.0
Any idea or clue from your side? Any more information you need, just let me know. Thank you.
/triage support
@cathyzhang05: The label(s) triage/support
cannot be applied, because the repository doesn't have them.
In response to this:
Hi team,
The Control Plan component "Cloud Controller Manager" for one cluster continuously failing to restart due to pod "cloud-controller-manager-*" OOMKilled. The detail is shown below:
After the maintenance time of the shoot cluster, the Control Plan components restarted, then the pod of cloud-controller-manager would be OOMKilled and can't be started again.
Checked the memory for deployment cloud-controller-manager, it reached the maximum of memory limitation 2.6Gi. The pod of cloud-controller-manager is in CrashLoopBackOff status.
Restarted the cloud-controller-manager deployment, but the OOMKilled issue happened again without any suspicious error found. Sometimes, the logs stopped at the "EnsuredLoadBalancer" step as shown:
{"log":"\"Event occurred\" object=\"XXXX" kind=\"Service\" apiVersion=\"v1\" type=\"Normal\" reason=\"EnsuredLoadBalancer\" message=\"Ensured load balancer\"","pid":"1","severity":"INFO","source":"event.go:294"}
- There are more than 110 nodes in this cluster, but I compared another cluster with more than 300 nodes which doesn't have the OOMKilled issue, the log is very similar when the pod started up.
The cluster version information:
- K8s: 1.23.4
- cloud-provider-aws: 1.24.0
Any idea or clue from your side? Any more information you need, just let me know. Thank you.
/triage support
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@cathyzhang05: This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I am working with @cathyzhang05 and got this profile from one of the OOM runs: profile.pb.gz. I hope this provides some extra data. To me it seems that the aws library has issues dealing efficiently with large API responses but I could be wrong. Have you seen that happen in the past ?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.