cluster-api-provider-aws
                                
                                 cluster-api-provider-aws copied to clipboard
                                
                                    cluster-api-provider-aws copied to clipboard
                            
                            
                            
                        ELB deletions: timed out waiting for the condition
/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.] First of all, thanks to all for your work. The project is very interesting and it is clear there is a lot of effort involved in it :). I am testing the creation of an AWS workload cluster from an existing EKS cluster and using existing VPC/subnet infra.
The cluster is created correctly, however during deletion I am reaching a blocker when trying to delete the LoadBalancer. It seems the CLB is deleted before the actual step of CLB deletion occurs. Here are the steps:
- Create a cluster
- Use kubectl delete cluster capi-quickstart -ndefaultto delete it
- All is deleted correctly, until it reaches the AWSCluster section, where it does delete the CLB but then starts to complain it hits timeout:
E0601 08:51:01.201142       1 awscluster_controller.go:167] controllers/AWSCluster "msg"="error deleting load balancer" "error"="failed to wait for \"capi-quickstart\" ELB deletions: timed out waiting for the condition" "awsCluster"="capi-quickstart" "cluster"="capi-quickstart" "namespace"="default" 
This then blocks cluster deletion.
Is there a way I can force this deletion on API level, as the CLB is correctly deleted but the cluster seems to enter a loop still trying to delete it?
What did you expect to happen:
Cluster deletion order to be preserved and not reach timeout errors.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Here is my cluster config:
apiVersion: cluster.x-k8s.io/v1alpha3
kind: Cluster
metadata:
  name: capi-quickstart
  namespace: default
  labels:
    cluster.x-k8s.io/cluster-name: capi-quickstart
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: KubeadmControlPlane
    name: capi-quickstart-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: AWSCluster
    name: capi-quickstart
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: AWSCluster
metadata:
  name: capi-quickstart
  namespace: default
spec:
  networkSpec:
    vpc:
      id: [...]
    subnets:
      [...]
  bastion:
    enabled: true
  additionalTags:
    [...]
  region: us-west-2
  sshKeyName: [...]
  controlPlaneLoadBalancer:
    scheme: internal
---
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: KubeadmControlPlane
metadata:
  name: capi-quickstart-control-plane
  namespace: default
spec:
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: AWSMachineTemplate
    name: capi-quickstart-control-plane
  kubeadmConfigSpec:
    clusterConfiguration:
      apiServer:
        extraArgs:
          cloud-provider: aws
      controllerManager:
        extraArgs:
          cloud-provider: aws
    initConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          cloud-provider: aws
        name: '{{ ds.meta_data.local_hostname }}'
    joinConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          cloud-provider: aws
        name: '{{ ds.meta_data.local_hostname }}'
  replicas: 3
  version: v1.19.8
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: AWSMachineTemplate
metadata:
  name: capi-quickstart-control-plane
  namespace: default
spec:
  template:
    spec:
      iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
      additionalTags:
         [...]
      instanceType: t3.small
      sshKeyName: [...]
---
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineDeployment
metadata:
  name: capi-quickstart-md-0
  namespace: default
spec:
  clusterName: capi-quickstart
  replicas: 3
  selector:
    matchLabels: null
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
          kind: KubeadmConfigTemplate
          name: capi-quickstart-md-0
      clusterName: capi-quickstart
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
        kind: AWSMachineTemplate
        name: capi-quickstart-md-0
      version: v1.19.8
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: AWSMachineTemplate
metadata:
  name: capi-quickstart-md-0
  namespace: default
spec:
  template:
    spec:
      iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
      additionalTags:
         [...]
      instanceType: t3.small
      sshKeyName: [...]
---
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: KubeadmConfigTemplate
metadata:
  name: capi-quickstart-md-0
  namespace: default
spec:
  template:
    spec:
      joinConfiguration:
        nodeRegistration:
          kubeletExtraArgs:
            cloud-provider: aws
          name: '{{ ds.meta_data.local_hostname }}'
---
apiVersion: addons.cluster.x-k8s.io/v1alpha3
kind: ClusterResourceSet
metadata:
  name: capi-quickstart-1-crs-0
  namespace: default
spec:
  clusterSelector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: capi-quickstart
  resources:
  - kind: ConfigMap
    name: calico-cni
  - kind: ConfigMap
    name: nginx-ingress
Environment:
- Cluster-api-provider-aws version:
Installing cert-manager Version="v1.1.0"
Waiting for cert-manager to be available...
Installing Provider="cluster-api" Version="v0.3.17" TargetNamespace="capi-system"
Installing Provider="bootstrap-kubeadm" Version="v0.3.17" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="control-plane-kubeadm" Version="v0.3.17" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="infrastructure-aws" Version="v0.6.6" TargetNamespace="capa-system"
- Kubernetes version: (use kubectl version): client 1.18.8 / server 1.19.8
- OS (e.g. from /etc/os-release): debian 10
Is CLB deleted by the controllers? Can you check the capa-manager logs and search for: deleting load balancer, if there is arn in that log line at one point, it means the controller deleted the CLB.
Is there a way I can force this deletion on API level, as the CLB is correctly deleted but the cluster seems to enter a loop still trying to delete it?
You can remove finalizer awscluster.infrastructure.cluster.x-k8s.io from the AWSCluster resource to delete it without waiting for the cleanup. But if this is not a user error, then we should fix this instead of this workaround.
Hey @sedefsavas, sorry for the late reply. In the capa-controller-manager pod logs I see this:
I0823 19:41:59.552941       1 awsmachine_controller.go:373]  "msg"="Terminating EC2 instance"  "instance-id"="i-0cc919bbcd649116a"
I0823 19:41:59.578752       1 awsmachine_controller.go:373]  "msg"="Terminating EC2 instance"  "instance-id"="i-0fbee9ca995b4225b"
I0823 19:41:59.607431       1 awsmachine_controller.go:373]  "msg"="Terminating EC2 instance"  "instance-id"="i-09f43c2b82ebb2a35"
I0823 19:44:00.356912       1 awsmachine_controller.go:419]  "msg"="EC2 instance successfully terminated"  "instance-id"="i-0cc919bbcd649116a"
E0823 19:44:00.476040       1 controller.go:304] controller-runtime/manager/controller/awsmachine "msg"="Reconciler error" "error"="awsmachines.infrastructure.cluster.x-k8s.io \"capi-quickstart-control-plane-s8t5t\" not found" "name"="capi-quickstart-control-plane-s8t5t" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSMachine" 
I0823 19:44:15.571083       1 awsmachine_controller.go:419]  "msg"="EC2 instance successfully terminated"  "instance-id"="i-09f43c2b82ebb2a35"
E0823 19:44:15.652809       1 controller.go:304] controller-runtime/manager/controller/awsmachine "msg"="Reconciler error" "error"="awsmachines.infrastructure.cluster.x-k8s.io \"capi-quickstart-control-plane-m29kf\" not found" "name"="capi-quickstart-control-plane-m29kf" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSMachine" 
I0823 19:44:30.549036       1 awsmachine_controller.go:419]  "msg"="EC2 instance successfully terminated"  "instance-id"="i-0fbee9ca995b4225b"
E0823 19:44:30.631608       1 controller.go:304] controller-runtime/manager/controller/awsmachine "msg"="Reconciler error" "error"="awsmachines.infrastructure.cluster.x-k8s.io \"capi-quickstart-control-plane-r4vqs\" not found" "name"="capi-quickstart-control-plane-r4vqs" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSMachine" 
I0823 19:44:31.472465       1 awscluster_controller.go:149] controller-runtime/manager/controller/awscluster "msg"="Reconciling AWSCluster delete" "cluster"="capi-quickstart" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
E0823 19:48:00.734703       1 awscluster_controller.go:165] controller-runtime/manager/controller/awscluster "msg"="error deleting load balancer" "error"="failed to wait for \"capi-quickstart\" ELB deletions: timed out waiting for the condition" "cluster"="capi-quickstart" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
E0823 19:48:00.899671       1 controller.go:304] controller-runtime/manager/controller/awscluster "msg"="Reconciler error" "error"="failed to wait for \"capi-quickstart\" ELB deletions: timed out waiting for the condition" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
I0823 19:48:00.900810       1 awscluster_controller.go:149] controller-runtime/manager/controller/awscluster "msg"="Reconciling AWSCluster delete" "cluster"="capi-quickstart" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
Basically, no entry in-between the deletion of control-plane nodes and AWSCluster object where ELB is deleted without any notice...
Indeed, I was able to delete the cluster after eliminating the finalizer, but the issue is still present after upgrading to latest clusterctl :/
If you need any further information, let me know. Thanks for your reply!
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle stale
- Mark this issue or PR as rotten with /lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
For refrerence, the "failed to wait for \"capi-quickstart\" ELB deletions: timed out waiting for the condition" error originates here: https://github.com/kubernetes-sigs/cluster-api-provider-aws//blob/8a81ce6890e5728d4c23f95363643b10ab89efb6/pkg/cloud/services/elb/loadbalancer.go#L142-L157
Something to note is that this code lists the ELBs for the cluster's LoadBalancer-type Service, as well as the API server. It is not clear from the error which ELB deletion the code is waiting on.
In v1.0, the code paths are separate:
https://github.com/kubernetes-sigs/cluster-api-provider-aws//blob/8803f1257dbd2c7a6bd1261ee39b185a965b2235/pkg/cloud/services/elb/loadbalancer.go#L179-L185
and
https://github.com/kubernetes-sigs/cluster-api-provider-aws//blob/8803f1257dbd2c7a6bd1261ee39b185a965b2235/pkg/cloud/services/elb/loadbalancer.go#L211-L220
/triage accepted /priority backlog
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/lifecycle frozen
/remove-lifecycle frozen
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle stale
- Mark this issue or PR as rotten with /lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with /reopen
- Mark this issue as fresh with /remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.