cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

aws-cloud-provider(version 1.27.1) always crash

Open datavisorhenryzhao opened this issue 2 years ago • 5 comments

What happened: k8s cluster: 1.27.6

master node: kubeadm_config.yaml, and run kubeadm join

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
containerLogMaxSize: "200Mi"
containerLogMaxFiles: 3
imageGCHighThresholdPercent: 80
imageGCLowThresholdPercent: 75
imageMinimumGCAge: "5m30s"
providerID: "aws"
evictionHard:
    memory.available:  "200Mi"
    imagefs.available: "15%"

worker node: run kubeadm join

cluster info

#kubectl get node 
NAME                            STATUS   ROLES           AGE   VERSION
ip-10-142-23-229.ec2.internal   Ready    <none>          36m   v1.27.6
ip-10-142-39-245.ec2.internal   Ready    control-plane   30h   v1.27.6
ip-10-142-42-164.ec2.internal   Ready    control-plane   30h   v1.27.6
ip-10-142-61-198.ec2.internal   Ready    control-plane   30h   v1.27.6

#kubectl get node ip-10-142-23-229.ec2.internal -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
   ...
spec:
  podCIDR: 192.168.8.0/24
  podCIDRs:
  - 192.168.8.0/24
  providerID: aws

aws cloud controller crash:

1115 07:06:42.124716       1 aws.go:861] Setting up informers for Cloud
W1115 07:06:42.124764       1 controllermanager.go:313] "tagging" is disabled
I1115 07:06:42.124773       1 controllermanager.go:317] Starting "cloud-node"
I1115 07:06:42.128849       1 controllermanager.go:336] Started "cloud-node"
I1115 07:06:42.131324       1 controllermanager.go:317] Starting "cloud-node-lifecycle"
I1115 07:06:42.128909       1 node_controller.go:161] Sending events to api server.
I1115 07:06:42.131591       1 node_controller.go:170] Waiting for informer caches to sync
I1115 07:06:42.131945       1 controllermanager.go:336] Started "cloud-node-lifecycle"
I1115 07:06:42.131964       1 controllermanager.go:317] Starting "service"
I1115 07:06:42.132052       1 node_lifecycle_controller.go:113] Sending events to api server
I1115 07:06:42.133178       1 controllermanager.go:336] Started "service"
I1115 07:06:42.133400       1 controllermanager.go:317] Starting "route"
I1115 07:06:42.133409       1 core.go:104] Will not configure cloud provider routes, --configure-cloud-routes: false
W1115 07:06:42.133418       1 controllermanager.go:324] Skipping "route"
I1115 07:06:42.133728       1 controller.go:229] Starting service controller
I1115 07:06:42.133802       1 shared_informer.go:311] Waiting for caches to sync for service
E1115 07:06:42.142644       1 runtime.go:79] Observed a panic: &errors.errorString{s:"unable to calculate an index entry for key \"ip-10-142-23-229.ec2.internal\" on index \"instanceID\": error mapping node \"ip-10-142-23-229.ec2.internal\"'s provider ID \"aws\" to instance ID: Invalid format for AWS instance (aws)"} (unable to calculate an index entry for key "ip-10-142-23-229.ec2.internal" on index "instanceID": error mapping node "ip-10-142-23-229.ec2.internal"'s provider ID "aws" to instance ID: Invalid format for AWS instance (aws))

What you expected to happen: aws-cloud-provider should not crash. How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.27.6
  • Cloud provider or hardware configuration: aws
  • OS (e.g. from /etc/os-release): Ubuntu 22.04.3 LTS
  • Kernel (e.g. uname -a): Linux ip-10-142-1-183 6.2.0-1015-aws #15~22.04.1-Ubuntu SMP Fri Oct
  • Install tools:
  • Others:

/kind bug

datavisorhenryzhao avatar Nov 15 '23 09:11 datavisorhenryzhao

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 15 '23 09:11 k8s-ci-robot

The node.spec.provider is "aws". But aws-cloud-provider expected 'providerID: aws:///us-east-1a/i-xxxx'

datavisorhenryzhao avatar Nov 15 '23 09:11 datavisorhenryzhao

I find when i start kubelet with "--cloud-provider=external" on master and worker nodes, the node.spec.providerId looks like "aws:///region/instnaceid". And aws cloud controller will not crash

datavisorhenryzhao avatar Nov 16 '23 08:11 datavisorhenryzhao

@datavisorhenryzhao Are you still seeing this issue?

mmerkes avatar Jan 02 '24 17:01 mmerkes

the crash issue has been fixed in https://github.com/kubernetes/cloud-provider-aws/pull/605. i will work in backporting the fix to older versions

kmala avatar Jan 16 '24 02:01 kmala

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 15 '24 02:04 k8s-triage-robot

This is resolved across all the active release branches.

cartermckinnon avatar Apr 15 '24 16:04 cartermckinnon