amazon-vpc-cni-k8s Dangling ENIs without any association with Instances

What happened: During one of incidents , where pods are failing due to IP address exhaustion, We noticed that there a lots of ENIs that are allocated , But are not attached to any Instances. Our first assumption was these might be the ENIs that are created to maintain warm pool on the nodes, But After checking them we discovered that there are no tags node.k8s.amazonaws.com/instance_id tags available on those ENIs, Which doesn’t seems like expected behaviour. https://github.com/aws/amazon-vpc-cni-k8s/blob/9db2ae62ecd0cb56f7fc20b80427fa6ff4e17a42/pkg/awsutils/awsutils.go#L606 As far i can see, Allocation and attachment of ENIs are so there shouldn’t be the case where ENIs are allocated but are not attached and have missing tags, Except here (https://github.com/aws/amazon-vpc-cni-k8s/blob/9db2ae62ecd0cb56f7fc20b80427fa6ff4e17a42/pkg/awsutils/awsutils.go#L616 ENI attach and delete both failed). To verify this i checked the prometheus metrics for AttachNetworkInterface api for any errors , but there are no significant increases here that explains this being the cause of increase in Allocated ENIs.

Apr 29 '21 17:04 Buffer0x7cd

Hi @Buffer0x7cd

Do you have short lived instances/cluster? Also do you have any node termination policy? There is one known issue (https://github.com/aws/amazon-vpc-cni-k8s/issues/1223), After ENI is detached, it will take few seconds for the ENI to delete, if in the mean time node is terminated then the ENI will be dangling in the account.

Apr 29 '21 17:04 jayanthvn

HI @jayanthvn

It doesn’t seems like this is the issue. https://github.com/aws/amazon-vpc-cni-k8s/blob/9db2ae62ecd0cb56f7fc20b80427fa6ff4e17a42/pkg/awsutils/awsutils.go#L836 From my understanding , In the case here. ENI will be First detached and deleted. Assuming the ENI was first Attached It should have the node.k8s.amazonaws.com/instance_id tag, Even after being detached ( As there is no steps to delete tags in the freeENI method).

In our observed case we can see that the dangling ENIs have no node.k8s.amazonaws.com/instance_id tag available , Which should be present if these Dangling ENIs were due to https://github.com/aws/amazon-vpc-cni-k8s/issues/1223

Apr 29 '21 18:04 Buffer0x7cd

Yeah makes sense, I quickly ran a test and detached an ENI and I still see the instance_id tag even though the ENI is detached. Can you please open a support case?

Apr 29 '21 18:04 jayanthvn

Hi @Buffer0x7cd

For the ENI, do you see the "node.k8s.amazonaws.com/createdAt" tag present?

May 18 '21 18:05 jayanthvn

@jayanthvn yes i can see the node.k8s.amazonaws.com/createdAt at tag present

Jul 13 '21 09:07 Buffer0x7cd

Thanks for checking @Buffer0x7cd. So looks like createENI is fine but if attachENI failed we would have deleted the ENI - https://github.com/aws/amazon-vpc-cni-k8s/blob/9db2ae62ecd0cb56f7fc20b80427fa6ff4e17a42/pkg/awsutils/awsutils.go#L612-L614. If you can open a support case, then we can check EC2 logs to confirm why attachENI failed.

Aug 04 '21 21:08 jayanthvn

We've noticed this while working on https://github.com/weaveworks/eksctl/ too. We recently managed to reproduce this issue: https://github.com/weaveworks/eksctl/issues/4214#issuecomment-923871267

Sep 22 '21 15:09 aclevername

We're seeing a similar/related issue but have cases where none of the active pods have ENIs that are attached to instances (the node has 2 ENIs with 10 and 1 private IP addresses respectively, and there are 13 pods on the node none of which use those ENIs). Not sure if this is actually the same issue but we've raised a support ticket (~~9328577341~~ 9331293811). The original reason we raised the ticket was due to pods getting stuck in Pending with events like:

Warning FailedScheduling 21s (x12 over 13m) default-scheduler 0/10 nodes are available: 4 node(s) didn't match node selector, 6 Insufficient vpc.amazonaws.com/pod-eni.

And further investigation led us to this issue, but it's unclear whether the issues are related.

Dec 10 '21 04:12 hiattp

Same issue running v1.7.5-eksbuild.1 on v1.21.5-eks-9017834. We have many unused ENI interfaces with just the node.k8s.amazonaws.com/createdAt tag set. This is pretty important since it can lead to available interface exhaustion causing service disruption.

Feb 09 '22 11:02 GaruGaru

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Apr 14 '22 00:04 github-actions[bot]

Not stale

Apr 14 '22 00:04 bryantbiggs

@aclevername - in the issue you mentioned we do see the node.k8s.amazonaws.com/instance_id. Typically this happens when node is terminated between delete and detach ENI calls.

@bryantbiggs or @GaruGaru - Can one of you please share IPAMD logs? You can email the log bundle to - [email protected]

Apr 18 '22 23:04 jayanthvn

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Jun 18 '22 00:06 github-actions[bot]

Not stale

Jun 18 '22 00:06 bryantbiggs

Tagging teammate @vidhyadharm about this "dangling ENI" issue, suggested by @bryantbiggs as root cause for our vpc deletion issue in eks blueprints and the corresponding vpc deletion issue in aws vpc module.

Sep 19 '22 15:09 timblaktu

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Nov 19 '22 00:11 github-actions[bot]

/not stale

Nov 19 '22 01:11 jayanthvn

amazon-vpc-cni-k8s amazon-vpc-cni-k8s copied to clipboard

Dangling ENIs without any association with Instances

amazon-vpc-cni-k8s
amazon-vpc-cni-k8s copied to clipboard