cloud-provider-aws Invalid format for AWS instance

What happened:

We've upgraded to k8s 1.25.15 and installed aws CCM in our cluster using kOps, since the upgrade we periodically see those error logs from aws CCM:

E1102 13:36:37.503965 1 node_lifecycle_controller.go:185] error checking if node <instance-id> is shutdown: Invalid format for AWS instance () }

We see that when node going up and joins the cluster, the node becomes ready and it not seems it effects the node scheduling but we still not sure why we see those Errors

What you expected to happen:

Not seeing erros on those nodes or more details error log

How to reproduce it (as minimally and precisely as possible):

Running cloud controller manager with version v1.25.12 in k8s 1.25.15

Environment:

Kubernetes version (use kubectl version): v1.25.15
Cloud provider or hardware configuration: AWS EC2,
OS (e.g. from /etc/os-release): Ubuntu 20.04.6
Kernel (e.g. uname -a): 5.15.0-1037-aws
Install tools: kOps
Others:

/kind bug

Nov 02 '23 13:11 doryer

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 02 '23 13:11 k8s-ci-robot

What does the node's provider ID look like?

Nov 02 '23 17:11 cartermckinnon

What does the node's provider ID look like?

aws:///eu-west-1a/

Nov 05 '23 09:11 doryer

That's definitely the problem. Do you know what's setting the provider ID on your nodes?

Nov 06 '23 19:11 cartermckinnon

That's definitely the problem. Do you know what's setting the provider ID on your nodes?

We're running the cluster with kOps so from docs seems like Node controller inside kops-controller is the one who applying this to the k8s node: https://kops.sigs.k8s.io/architecture/kops-controller/. It is also contains the instance-id so it looks like aws:///eu-west-1a/ Anyway, what should it be to be valid?

Nov 07 '23 11:11 doryer

The log message you included looks like the provider wasn't defined at all (name is blank): https://github.com/kubernetes/cloud-provider-aws/blob/a1eb96d8ee3baffa8450e870c7360afa6ca836d2/pkg/providers/v1/instances.go#L82

I don't know how the provider ID is set in a kops cluster.

/kind support

Nov 07 '23 18:11 cartermckinnon

The log message you included looks like the provider wasn't defined at all (name is blank):

https://github.com/kubernetes/cloud-provider-aws/blob/a1eb96d8ee3baffa8450e870c7360afa6ca836d2/pkg/providers/v1/instances.go#L82

I don't know how the provider ID is set in a kops cluster.

/kind support

Ok so it helped me understand the issue. providerID is a field being added by kops-controller to every node that joins the cluster. seems like we saw for each instance that joined the cluster the error log a moment before he joins the cluster, so probably the node lifecycle controller is checking the node before the providerID being added. after kops-controller adding the providerID to the node the errors disappeared. Maybe adding retries to the node lifecycle controller by checking the providerID to environments managed by kOps can solve the issue

Nov 08 '23 15:11 doryer

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 06 '24 15:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 07 '24 16:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Apr 06 '24 16:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 06 '24 16:04 k8s-ci-robot

@cartermckinnon is it possible to /reopen this issue please ? We are seeing the same error messages in our clusters deployed with kubeadm and it seems related to the explanation above (race condition between the tool setting the providerID on the node - kops or kubelet - and the node controller)

Jul 10 '24 09:07 yogeek

@yogeek do you mind opening a fresh issue with exact details from your clusters? (then cross link this issue from there)

Jul 10 '24 12:07 dims

cloud-provider-aws cloud-provider-aws copied to clipboard

Invalid format for AWS instance

cloud-provider-aws
cloud-provider-aws copied to clipboard