amazon-vpc-cni-k8s
amazon-vpc-cni-k8s copied to clipboard
Dangling ENIs without any association with Instances
What happened:
During one of incidents , where pods are failing due to IP address exhaustion, We noticed that there a lots of ENIs that are allocated , But are not attached to any Instances. Our first assumption was these might be the ENIs that are created to maintain warm pool on the nodes, But After checking them we discovered that there are no tags node.k8s.amazonaws.com/instance_id
tags available on those ENIs, Which doesn’t seems like expected behaviour.
https://github.com/aws/amazon-vpc-cni-k8s/blob/9db2ae62ecd0cb56f7fc20b80427fa6ff4e17a42/pkg/awsutils/awsutils.go#L606
As far i can see, Allocation and attachment of ENIs are so there shouldn’t be the case where ENIs are allocated but are not attached and have missing tags, Except here (https://github.com/aws/amazon-vpc-cni-k8s/blob/9db2ae62ecd0cb56f7fc20b80427fa6ff4e17a42/pkg/awsutils/awsutils.go#L616
ENI attach and delete both failed). To verify this i checked the prometheus metrics for AttachNetworkInterface
api for any errors , but there are no significant increases here that explains this being the cause of increase in Allocated ENIs.
Hi @Buffer0x7cd
Do you have short lived instances/cluster? Also do you have any node termination policy? There is one known issue (https://github.com/aws/amazon-vpc-cni-k8s/issues/1223), After ENI is detached, it will take few seconds for the ENI to delete, if in the mean time node is terminated then the ENI will be dangling in the account.
HI @jayanthvn
It doesn’t seems like this is the issue. https://github.com/aws/amazon-vpc-cni-k8s/blob/9db2ae62ecd0cb56f7fc20b80427fa6ff4e17a42/pkg/awsutils/awsutils.go#L836
From my understanding , In the case here. ENI will be First detached and deleted. Assuming the ENI was first Attached It should have the node.k8s.amazonaws.com/instance_id
tag, Even after being detached ( As there is no steps to delete tags in the freeENI
method).
In our observed case we can see that the dangling ENIs have no node.k8s.amazonaws.com/instance_id
tag available , Which should be present if these Dangling ENIs were due to https://github.com/aws/amazon-vpc-cni-k8s/issues/1223
Yeah makes sense, I quickly ran a test and detached an ENI and I still see the instance_id tag even though the ENI is detached. Can you please open a support case?
Hi @Buffer0x7cd
For the ENI, do you see the "node.k8s.amazonaws.com/createdAt" tag present?
@jayanthvn yes i can see the node.k8s.amazonaws.com/createdAt at tag present
Thanks for checking @Buffer0x7cd. So looks like createENI is fine but if attachENI failed we would have deleted the ENI - https://github.com/aws/amazon-vpc-cni-k8s/blob/9db2ae62ecd0cb56f7fc20b80427fa6ff4e17a42/pkg/awsutils/awsutils.go#L612-L614. If you can open a support case, then we can check EC2 logs to confirm why attachENI failed.
We've noticed this while working on https://github.com/weaveworks/eksctl/ too. We recently managed to reproduce this issue: https://github.com/weaveworks/eksctl/issues/4214#issuecomment-923871267
We're seeing a similar/related issue but have cases where none of the active pods have ENIs that are attached to instances (the node has 2 ENIs with 10 and 1 private IP addresses respectively, and there are 13 pods on the node none of which use those ENIs). Not sure if this is actually the same issue but we've raised a support ticket (~~9328577341~~ 9331293811). The original reason we raised the ticket was due to pods getting stuck in Pending with events like:
Warning FailedScheduling 21s (x12 over 13m) default-scheduler 0/10 nodes are available: 4 node(s) didn't match node selector, 6 Insufficient vpc.amazonaws.com/pod-eni.
And further investigation led us to this issue, but it's unclear whether the issues are related.
Same issue running v1.7.5-eksbuild.1
on v1.21.5-eks-9017834
.
We have many unused ENI interfaces with just the node.k8s.amazonaws.com/createdAt
tag set.
This is pretty important since it can lead to available interface exhaustion causing service disruption.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
Not stale
@aclevername - in the issue you mentioned we do see the node.k8s.amazonaws.com/instance_id
. Typically this happens when node is terminated between delete and detach ENI calls.
@bryantbiggs or @GaruGaru - Can one of you please share IPAMD logs? You can email the log bundle to - [email protected]
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
Not stale
Tagging teammate @vidhyadharm about this "dangling ENI" issue, suggested by @bryantbiggs as root cause for our vpc deletion issue in eks blueprints and the corresponding vpc deletion issue in aws vpc module.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
/not stale