amazon-vpc-cni-k8s icon indicating copy to clipboard operation
amazon-vpc-cni-k8s copied to clipboard

Avoid detaching ENIs on nodes being drained

Open mogren opened this issue 4 years ago • 19 comments

What would you like to be added: We should prevent ipamd from trying to free ENIs when a node is about to terminated.

For spot instances we could do something similar to the aws-node-termination-handler and check some metadata endpoints.

For the case where a node is cordoned off before being terminated, meaning it is marked as "unschedulable", we should be able to check this node taints before trying to attach or detach any ENIs.

Why is this needed: Since there is no EC2 API call to directly "delete" an ENI that is attached, instead they first have to be detached, which takes a few seconds, then deleted. If the instance gets terminated after the ENI has been detached, but before it has been deleted, it will be leaked. This leaked ENI can prevent Security Groups and VPCs from being deleted and require manual clean up.

Related issues: #608 #69, #690

mogren avatar Sep 18 '20 17:09 mogren

There are 2 parts to this - 1. There is an internal tracking ticket with Ec2 team to see if this can be handled. 2. IPAMD can check if the node is marked as unschedulable then prevent ENI deletion. But the cleaner approach is #1.

jayanthvn avatar Jan 21 '21 21:01 jayanthvn

Pending POC on the IPAMD changes. Also following up with ec2 team if this can be handled internally.

jayanthvn avatar Jan 27 '21 20:01 jayanthvn

Do we have an ETA on this fix please? This is causing us a lot of issues, and substantial coding to workaround it.

hurriyetgiray-ping avatar Jun 22 '21 10:06 hurriyetgiray-ping

Can this be tagged as a bug please, as this is more than a feature request?

hurriyetgiray-ping avatar Jun 23 '21 09:06 hurriyetgiray-ping

Hi @hurriyetgiray-ping,

What is the CNI version you are using and are you having short lived clusters in the account? currently we have a background thread - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/awsutils/awsutils.go#L382 which cleans up leaked ENIs in the account and I agree this would need atleast one aws-node pod running in the account.

jayanthvn avatar Jun 28 '21 20:06 jayanthvn

Hello @jayanthvn. Thank you for your response and also for tagging this issue as a bug. CNI version is V1.7.5 as per the below kubectl describe.

bash-4.2$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.7.5
amazon-k8s-cni:v1.7.5 

Would the background thread you mention help us? How frequently does it run?

Not sure what qualifies a cluster's life span as 'short', but this particular use case could possibly qualify as one. We have an EKS cluster but we manage the nodes ourselves. As part of our end-to-end test suite, we create nodes, performs tests and then delete them via Cloudformation. Here are the related CloudFormation stack events and timelines showing the delete error.


2021-06-23 16:51:27 UTC+0100 eks-stateful-node-xxx DELETE_FAILED The following resource(s) failed to delete: [WorkerNodeSecurityGroup].
2021-06-23 16:51:26 UTC+0100 WorkerNodeSecurityGroup DELETE_FAILED resource sg-xxx has a dependent object (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: xxx; Proxy: null)

2021-06-23 16:07:10 UTC+0100 WorkerNodeSecurityGroup CREATE_COMPLETE  (creates sg-xxx)

2021-06-23 16:06:59 UTC+0100 eks-stateful-node-xxx CREATE_IN_PROGRESS	Transformation succeeded
2021-06-23 16:06:52 UTC+0100 eks-stateful-node-xxx CREATE_IN_PROGRESS

hurriyetgiray-ping avatar Jun 29 '21 11:06 hurriyetgiray-ping

kOps is also having problems with ENIs being leaked by amazon-vpc upon cluster deletion.

I suspect the window between creation of an ENI and attaching it with DeleteOnTermination set is also a factor.

johngmyers avatar Dec 31 '21 20:12 johngmyers

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Apr 13 '22 00:04 github-actions[bot]

People have been complaining about this since 2019. Can we get this taken care of now please?

Nuru avatar Apr 13 '22 09:04 Nuru

@Nuru - #1927 should mitigate this issue up to some extent but we are actively working with EC2 team on the implementation of detach and delete calls.

jayanthvn avatar Apr 13 '22 14:04 jayanthvn

@jayanthvn anything preventing #1927 from being merged?

bryantbiggs avatar Apr 13 '22 14:04 bryantbiggs

@bryantbiggs - It is pending code review. We are tracking it for 1.11.1 release. I will provide an ETA soon.

jayanthvn avatar Apr 13 '22 15:04 jayanthvn

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Jun 13 '22 00:06 github-actions[bot]

/not stale

jayanthvn avatar Jun 13 '22 14:06 jayanthvn

Hi, any update here? Thanks

EladGabay avatar Jul 21 '22 09:07 EladGabay

https://github.com/aws/amazon-vpc-cni-k8s/pull/1927 mitigates the issue to certain extent. But we are actively working with EC2/VPC team on fixing the backend calls.

jayanthvn avatar Jul 21 '22 13:07 jayanthvn

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Sep 21 '22 17:09 github-actions[bot]

/not stale

Nuru avatar Sep 23 '22 08:09 Nuru

@jayanthvn how do we get this marked as "not stale"?

Nuru avatar Sep 23 '22 18:09 Nuru

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Nov 24 '22 00:11 github-actions[bot]

/not stale

jayanthvn avatar Nov 24 '22 00:11 jayanthvn