cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

Invalid format for AWS instance

Open doryer opened this issue 2 years ago • 9 comments

What happened:

We've upgraded to k8s 1.25.15 and installed aws CCM in our cluster using kOps, since the upgrade we periodically see those error logs from aws CCM:

E1102 13:36:37.503965 1 node_lifecycle_controller.go:185] error checking if node <instance-id> is shutdown: Invalid format for AWS instance () }

We see that when node going up and joins the cluster, the node becomes ready and it not seems it effects the node scheduling but we still not sure why we see those Errors

What you expected to happen:

Not seeing erros on those nodes or more details error log

How to reproduce it (as minimally and precisely as possible):

Running cloud controller manager with version v1.25.12 in k8s 1.25.15

Environment:

  • Kubernetes version (use kubectl version): v1.25.15
  • Cloud provider or hardware configuration: AWS EC2,
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.6
  • Kernel (e.g. uname -a): 5.15.0-1037-aws
  • Install tools: kOps
  • Others:

/kind bug

doryer avatar Nov 02 '23 13:11 doryer

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 02 '23 13:11 k8s-ci-robot

What does the node's provider ID look like?

cartermckinnon avatar Nov 02 '23 17:11 cartermckinnon

What does the node's provider ID look like?

aws:///eu-west-1a/

doryer avatar Nov 05 '23 09:11 doryer

That's definitely the problem. Do you know what's setting the provider ID on your nodes?

cartermckinnon avatar Nov 06 '23 19:11 cartermckinnon

That's definitely the problem. Do you know what's setting the provider ID on your nodes?

We're running the cluster with kOps so from docs seems like Node controller inside kops-controller is the one who applying this to the k8s node: https://kops.sigs.k8s.io/architecture/kops-controller/. It is also contains the instance-id so it looks like aws:///eu-west-1a/ Anyway, what should it be to be valid?

doryer avatar Nov 07 '23 11:11 doryer

The log message you included looks like the provider wasn't defined at all (name is blank): https://github.com/kubernetes/cloud-provider-aws/blob/a1eb96d8ee3baffa8450e870c7360afa6ca836d2/pkg/providers/v1/instances.go#L82

I don't know how the provider ID is set in a kops cluster.

/kind support

cartermckinnon avatar Nov 07 '23 18:11 cartermckinnon

The log message you included looks like the provider wasn't defined at all (name is blank):

https://github.com/kubernetes/cloud-provider-aws/blob/a1eb96d8ee3baffa8450e870c7360afa6ca836d2/pkg/providers/v1/instances.go#L82

I don't know how the provider ID is set in a kops cluster.

/kind support

Ok so it helped me understand the issue. providerID is a field being added by kops-controller to every node that joins the cluster. seems like we saw for each instance that joined the cluster the error log a moment before he joins the cluster, so probably the node lifecycle controller is checking the node before the providerID being added. after kops-controller adding the providerID to the node the errors disappeared. Maybe adding retries to the node lifecycle controller by checking the providerID to environments managed by kOps can solve the issue

doryer avatar Nov 08 '23 15:11 doryer

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 06 '24 15:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 07 '24 16:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Apr 06 '24 16:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 06 '24 16:04 k8s-ci-robot

@cartermckinnon is it possible to /reopen this issue please ? We are seeing the same error messages in our clusters deployed with kubeadm and it seems related to the explanation above (race condition between the tool setting the providerID on the node - kops or kubelet - and the node controller)

yogeek avatar Jul 10 '24 09:07 yogeek

@yogeek do you mind opening a fresh issue with exact details from your clusters? (then cross link this issue from there)

dims avatar Jul 10 '24 12:07 dims