cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

ELB deletions: timed out waiting for the condition

Open haho16 opened this issue 4 years ago • 8 comments

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.] First of all, thanks to all for your work. The project is very interesting and it is clear there is a lot of effort involved in it :). I am testing the creation of an AWS workload cluster from an existing EKS cluster and using existing VPC/subnet infra.

The cluster is created correctly, however during deletion I am reaching a blocker when trying to delete the LoadBalancer. It seems the CLB is deleted before the actual step of CLB deletion occurs. Here are the steps:

  • Create a cluster
  • Use kubectl delete cluster capi-quickstart -ndefault to delete it
  • All is deleted correctly, until it reaches the AWSCluster section, where it does delete the CLB but then starts to complain it hits timeout:
E0601 08:51:01.201142       1 awscluster_controller.go:167] controllers/AWSCluster "msg"="error deleting load balancer" "error"="failed to wait for \"capi-quickstart\" ELB deletions: timed out waiting for the condition" "awsCluster"="capi-quickstart" "cluster"="capi-quickstart" "namespace"="default" 

This then blocks cluster deletion.

Is there a way I can force this deletion on API level, as the CLB is correctly deleted but the cluster seems to enter a loop still trying to delete it?

What did you expect to happen:

Cluster deletion order to be preserved and not reach timeout errors.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Here is my cluster config:

apiVersion: cluster.x-k8s.io/v1alpha3
kind: Cluster
metadata:
  name: capi-quickstart
  namespace: default
  labels:
    cluster.x-k8s.io/cluster-name: capi-quickstart
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: KubeadmControlPlane
    name: capi-quickstart-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: AWSCluster
    name: capi-quickstart
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: AWSCluster
metadata:
  name: capi-quickstart
  namespace: default
spec:
  networkSpec:
    vpc:
      id: [...]
    subnets:
      [...]
  bastion:
    enabled: true
  additionalTags:
    [...]
  region: us-west-2
  sshKeyName: [...]
  controlPlaneLoadBalancer:
    scheme: internal
---
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: KubeadmControlPlane
metadata:
  name: capi-quickstart-control-plane
  namespace: default
spec:
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: AWSMachineTemplate
    name: capi-quickstart-control-plane
  kubeadmConfigSpec:
    clusterConfiguration:
      apiServer:
        extraArgs:
          cloud-provider: aws
      controllerManager:
        extraArgs:
          cloud-provider: aws
    initConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          cloud-provider: aws
        name: '{{ ds.meta_data.local_hostname }}'
    joinConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          cloud-provider: aws
        name: '{{ ds.meta_data.local_hostname }}'
  replicas: 3
  version: v1.19.8
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: AWSMachineTemplate
metadata:
  name: capi-quickstart-control-plane
  namespace: default
spec:
  template:
    spec:
      iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
      additionalTags:
         [...]
      instanceType: t3.small
      sshKeyName: [...]
---
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineDeployment
metadata:
  name: capi-quickstart-md-0
  namespace: default
spec:
  clusterName: capi-quickstart
  replicas: 3
  selector:
    matchLabels: null
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
          kind: KubeadmConfigTemplate
          name: capi-quickstart-md-0
      clusterName: capi-quickstart
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
        kind: AWSMachineTemplate
        name: capi-quickstart-md-0
      version: v1.19.8
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: AWSMachineTemplate
metadata:
  name: capi-quickstart-md-0
  namespace: default
spec:
  template:
    spec:
      iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
      additionalTags:
         [...]
      instanceType: t3.small
      sshKeyName: [...]
---
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: KubeadmConfigTemplate
metadata:
  name: capi-quickstart-md-0
  namespace: default
spec:
  template:
    spec:
      joinConfiguration:
        nodeRegistration:
          kubeletExtraArgs:
            cloud-provider: aws
          name: '{{ ds.meta_data.local_hostname }}'
---
apiVersion: addons.cluster.x-k8s.io/v1alpha3
kind: ClusterResourceSet
metadata:
  name: capi-quickstart-1-crs-0
  namespace: default
spec:
  clusterSelector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: capi-quickstart
  resources:
  - kind: ConfigMap
    name: calico-cni
  - kind: ConfigMap
    name: nginx-ingress

Environment:

  • Cluster-api-provider-aws version:
Installing cert-manager Version="v1.1.0"
Waiting for cert-manager to be available...
Installing Provider="cluster-api" Version="v0.3.17" TargetNamespace="capi-system"
Installing Provider="bootstrap-kubeadm" Version="v0.3.17" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="control-plane-kubeadm" Version="v0.3.17" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="infrastructure-aws" Version="v0.6.6" TargetNamespace="capa-system"
  • Kubernetes version: (use kubectl version): client 1.18.8 / server 1.19.8
  • OS (e.g. from /etc/os-release): debian 10

haho16 avatar Jun 01 '21 09:06 haho16

Is CLB deleted by the controllers? Can you check the capa-manager logs and search for: deleting load balancer, if there is arn in that log line at one point, it means the controller deleted the CLB.

Is there a way I can force this deletion on API level, as the CLB is correctly deleted but the cluster seems to enter a loop still trying to delete it?

You can remove finalizer awscluster.infrastructure.cluster.x-k8s.io from the AWSCluster resource to delete it without waiting for the cleanup. But if this is not a user error, then we should fix this instead of this workaround.

sedefsavas avatar Jun 03 '21 03:06 sedefsavas

Hey @sedefsavas, sorry for the late reply. In the capa-controller-manager pod logs I see this:

I0823 19:41:59.552941       1 awsmachine_controller.go:373]  "msg"="Terminating EC2 instance"  "instance-id"="i-0cc919bbcd649116a"
I0823 19:41:59.578752       1 awsmachine_controller.go:373]  "msg"="Terminating EC2 instance"  "instance-id"="i-0fbee9ca995b4225b"
I0823 19:41:59.607431       1 awsmachine_controller.go:373]  "msg"="Terminating EC2 instance"  "instance-id"="i-09f43c2b82ebb2a35"
I0823 19:44:00.356912       1 awsmachine_controller.go:419]  "msg"="EC2 instance successfully terminated"  "instance-id"="i-0cc919bbcd649116a"
E0823 19:44:00.476040       1 controller.go:304] controller-runtime/manager/controller/awsmachine "msg"="Reconciler error" "error"="awsmachines.infrastructure.cluster.x-k8s.io \"capi-quickstart-control-plane-s8t5t\" not found" "name"="capi-quickstart-control-plane-s8t5t" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSMachine" 
I0823 19:44:15.571083       1 awsmachine_controller.go:419]  "msg"="EC2 instance successfully terminated"  "instance-id"="i-09f43c2b82ebb2a35"
E0823 19:44:15.652809       1 controller.go:304] controller-runtime/manager/controller/awsmachine "msg"="Reconciler error" "error"="awsmachines.infrastructure.cluster.x-k8s.io \"capi-quickstart-control-plane-m29kf\" not found" "name"="capi-quickstart-control-plane-m29kf" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSMachine" 
I0823 19:44:30.549036       1 awsmachine_controller.go:419]  "msg"="EC2 instance successfully terminated"  "instance-id"="i-0fbee9ca995b4225b"
E0823 19:44:30.631608       1 controller.go:304] controller-runtime/manager/controller/awsmachine "msg"="Reconciler error" "error"="awsmachines.infrastructure.cluster.x-k8s.io \"capi-quickstart-control-plane-r4vqs\" not found" "name"="capi-quickstart-control-plane-r4vqs" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSMachine" 

I0823 19:44:31.472465       1 awscluster_controller.go:149] controller-runtime/manager/controller/awscluster "msg"="Reconciling AWSCluster delete" "cluster"="capi-quickstart" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
E0823 19:48:00.734703       1 awscluster_controller.go:165] controller-runtime/manager/controller/awscluster "msg"="error deleting load balancer" "error"="failed to wait for \"capi-quickstart\" ELB deletions: timed out waiting for the condition" "cluster"="capi-quickstart" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
E0823 19:48:00.899671       1 controller.go:304] controller-runtime/manager/controller/awscluster "msg"="Reconciler error" "error"="failed to wait for \"capi-quickstart\" ELB deletions: timed out waiting for the condition" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
I0823 19:48:00.900810       1 awscluster_controller.go:149] controller-runtime/manager/controller/awscluster "msg"="Reconciling AWSCluster delete" "cluster"="capi-quickstart" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 

Basically, no entry in-between the deletion of control-plane nodes and AWSCluster object where ELB is deleted without any notice...

Indeed, I was able to delete the cluster after eliminating the finalizer, but the issue is still present after upgrading to latest clusterctl :/

If you need any further information, let me know. Thanks for your reply!

haho16 avatar Aug 23 '21 22:08 haho16

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 21 '21 23:11 k8s-triage-robot

For refrerence, the "failed to wait for \"capi-quickstart\" ELB deletions: timed out waiting for the condition" error originates here: https://github.com/kubernetes-sigs/cluster-api-provider-aws//blob/8a81ce6890e5728d4c23f95363643b10ab89efb6/pkg/cloud/services/elb/loadbalancer.go#L142-L157

Something to note is that this code lists the ELBs for the cluster's LoadBalancer-type Service, as well as the API server. It is not clear from the error which ELB deletion the code is waiting on.

In v1.0, the code paths are separate:

https://github.com/kubernetes-sigs/cluster-api-provider-aws//blob/8803f1257dbd2c7a6bd1261ee39b185a965b2235/pkg/cloud/services/elb/loadbalancer.go#L179-L185

and

https://github.com/kubernetes-sigs/cluster-api-provider-aws//blob/8803f1257dbd2c7a6bd1261ee39b185a965b2235/pkg/cloud/services/elb/loadbalancer.go#L211-L220

dlipovetsky avatar Nov 29 '21 18:11 dlipovetsky

/triage accepted /priority backlog

richardcase avatar Nov 29 '21 19:11 richardcase

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 29 '21 19:12 k8s-triage-robot

/lifecycle frozen

richardcase avatar Jan 10 '22 08:01 richardcase

/remove-lifecycle frozen

richardcase avatar Jul 12 '22 16:07 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 23 '22 21:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 22 '22 21:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Dec 22 '22 22:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 22 '22 22:12 k8s-ci-robot