cloud-provider-aws
cloud-provider-aws copied to clipboard
Invalid format for AWS instance
What happened:
We've upgraded to k8s 1.25.15 and installed aws CCM in our cluster using kOps, since the upgrade we periodically see those error logs from aws CCM:
E1102 13:36:37.503965 1 node_lifecycle_controller.go:185] error checking if node <instance-id> is shutdown: Invalid format for AWS instance () }
We see that when node going up and joins the cluster, the node becomes ready and it not seems it effects the node scheduling but we still not sure why we see those Errors
What you expected to happen:
Not seeing erros on those nodes or more details error log
How to reproduce it (as minimally and precisely as possible):
Running cloud controller manager with version v1.25.12 in k8s 1.25.15
Environment:
- Kubernetes version (use
kubectl version): v1.25.15 - Cloud provider or hardware configuration: AWS EC2,
- OS (e.g. from /etc/os-release): Ubuntu 20.04.6
- Kernel (e.g.
uname -a): 5.15.0-1037-aws - Install tools: kOps
- Others:
/kind bug
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
What does the node's provider ID look like?
What does the node's provider ID look like?
aws:///eu-west-1a/
That's definitely the problem. Do you know what's setting the provider ID on your nodes?
That's definitely the problem. Do you know what's setting the provider ID on your nodes?
We're running the cluster with kOps so from docs seems like Node controller inside kops-controller is the one who applying this to the k8s node: https://kops.sigs.k8s.io/architecture/kops-controller/. It is also contains the instance-id so it looks like aws:///eu-west-1a/ Anyway, what should it be to be valid?
The log message you included looks like the provider wasn't defined at all (name is blank): https://github.com/kubernetes/cloud-provider-aws/blob/a1eb96d8ee3baffa8450e870c7360afa6ca836d2/pkg/providers/v1/instances.go#L82
I don't know how the provider ID is set in a kops cluster.
/kind support
The log message you included looks like the provider wasn't defined at all (
nameis blank):https://github.com/kubernetes/cloud-provider-aws/blob/a1eb96d8ee3baffa8450e870c7360afa6ca836d2/pkg/providers/v1/instances.go#L82
I don't know how the provider ID is set in a kops cluster.
/kind support
Ok so it helped me understand the issue. providerID is a field being added by kops-controller to every node that joins the cluster. seems like we saw for each instance that joined the cluster the error log a moment before he joins the cluster, so probably the node lifecycle controller is checking the node before the providerID being added. after kops-controller adding the providerID to the node the errors disappeared. Maybe adding retries to the node lifecycle controller by checking the providerID to environments managed by kOps can solve the issue
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@cartermckinnon is it possible to /reopen this issue please ? We are seeing the same error messages in our clusters deployed with kubeadm and it seems related to the explanation above (race condition between the tool setting the providerID on the node - kops or kubelet - and the node controller)
@yogeek do you mind opening a fresh issue with exact details from your clusters? (then cross link this issue from there)