amazon-vpc-cni-k8s
amazon-vpc-cni-k8s copied to clipboard
IPAMD fails to start
What happened: IPAMD fails to start with iptables error. The aws-node pods fail to start and prevent worker nodes from going ready. This is occurring after updating to rocky linux 8.5 which is based on rhel 8.5.
/var/log/aws-routed-eni/ipamd.log
{"level":"error","ts":"2022-02-04T14:38:08.239Z","caller":"networkutils/network.go:385","msg":"ipt.NewChain error for chain [AWS-SNAT-CHAIN-0]: running [/usr/sbin/iptables -t nat -N AWS-SNAT-CHAIN-0 --wait]: exit status 3: iptables v1.8.4 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?)\nPerhaps iptables or your kernel needs to be upgraded.\n"}
POD logs kubectl logs -n kube-system aws-node-9tqb6
{"level":"info","ts":"2022-02-04T15:11:48.035Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-02-04T15:11:48.036Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-02-04T15:11:48.062Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-02-04T15:11:48.071Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-02-04T15:11:50.092Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:52.103Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:54.115Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:56.124Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
Attach logs
What you expected to happen: Expect ipamd to start normally.
How to reproduce it (as minimally and precisely as possible): Deploy eks cluster with ami based on Rocky 8.5. In theory any rhel 8.5 could have this problem.
Anything else we need to know?: Running the iptables command from the ipamd log as root on the worker node works fine.
Environment:
- Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.15-eks-9c63c4", GitCommit:"9c63c4037a56f9cad887ee76d55142abd4155179", GitTreeState:"clean", BuildDate:"2021-10-20T00:21:03Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
- CNI: 1.10.1
- OS (e.g:
cat /etc/os-release
): NAME="Rocky Linux" VERSION="8.5 (Green Obsidian)" ID="rocky" ID_LIKE="rhel centos fedora" VERSION_ID="8.5" PLATFORM_ID="platform:el8" PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:rocky:rocky:8:GA" HOME_URL="https://rockylinux.org/" BUG_REPORT_URL="https://bugs.rockylinux.org/" ROCKY_SUPPORT_PRODUCT="Rocky Linux" ROCKY_SUPPORT_PRODUCT_VERSION="8" - Kernel (e.g.
uname -a
):Linux ip-10-2--xx-xxx.ec2.xxxxxxxx.com 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Wed Jan 19 17:53:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
We found by loading ip_tables, iptable_nat, and iptable_mangle kernel modules fixes the issue: modprobe ip_tables iptable_nat iptable_mangle
Still trying to figure out why these modules where loaded be default in 8.4 and not in 8.5. Also still not sure why the same iptables commands work without these modules directly on the worker instance and not in the container.
We do install iptables
by default in aws-node
container images. Good to check the changelog between 8.4 & 8.5 for any insights in to the observed behavior.
@grumpymatt I have been getting the same issue while setting up EKS on rhel8.5. And after loading the kernel modules, it does work fine. The strange thing is I had tried the same in RHEL8.0 worker nodes and still getting the same issue. It works fine in RHEL7.x, though.
@grumpymatt Since the issue is clearly tied to missing iptables
modules, I think we can close this issue. Let us know, if there is any other concern.
@vishal0nhce Yeah, iptables
module is required for VPC CNI and not sure why it is missing in rhel8.5. I don't see any specific call out for rhel 8.5 around this.
We found an alternative way of fixing it by updating iptables inside the CNI container image.
from 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.10.1
run yum install iptables-nft -y
run cd /usr/sbin && rm iptables && ln -s xtables-nft-multi iptables
My concern is the direction of RHEL and downstream distros seems to be away from iptables-legacy and to iptables-nft. Is there any plans to release address this in the CNI container image?
Interesting. So, RHEL 8 doesn't support iptables-legacy
anymore? That explains the issue. I think iptables
legacy mode is sort of the default (at least for now) for most distributions and in particular Amazon Linux 2 images use iptables-legacy
by default as well. We will track AL2 images for our default CNI builds. Will check and update if there is something we can do to address this scenario.
We are seeing a similar situation where IPAM-D
won't start successfully and the aws-node
pod would restart at least once. We are running eks 1.20.
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
@bilby91 - Can you please check if kube-proxy is taking time to start? Kube-proxy should setup rules for aws-node to reach API Server on startup.
Similar error of IPAMD failing to start with latest version v1.11.0. Kube-proxy is already running successfully. Only VPC CNI image update from 1.9.0 to 1.11.0. Any clue what's wrong with the latest version? TIA
{"level":"info","ts":"2022-04-21T19:44:43.569Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
Similar error of IPAMD failing to start with latest version v1.11.0. Kube-proxy is already running successfully.
{"level":"info","ts":"2022-04-27T10:07:56.670Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
I was seeing this error. In my case, a developer had manually created VPC endpoints for a few services, including STS, resulting in traffic to the services being blackholed. So ipamd
could not create a session to collect the information it needed to.
I am also facing the same issue while trying to upgrade the cluster from 1.19 to 1.20 in EKS . Can't pinpoint the exact problem.
@dhavaln-able and @kathy-lee - So with v1.11.0 is aws-node continuously crashing or is it coming up after few restarts?
@sahil100122 - You mean while upgrading kube-proxy is up and running but ipamd is not starting at all?
I found FlatCar CoreOS also encounter related issue, the iptables command of FlatCar CoreOS version 3033.2.0 uses the nftables kernel backend instead of the iptables backend, that leads to the pod which belong to secondary eni cannot access K8s internal ClusterIP
Thank for @grumpymatt's workaround, after I follow the same way to build customized amazon-k8s-cni container image, currently aws vpc cni works in the version of FlatCar CoreOS greater than 3033.2.0
Had same issue while upgrading, but after looking at the trouble shooting guide and patching the daemonset with the following, aws-node
came up as expected and without issues.
# New env vars introduced with 1.10.x
- op: add
path: "/spec/template/spec/initContainers/0/env/-"
value: {"name": "ENABLE_IPv6", "value": "false"}
- op: add
path: "/spec/template/spec/containers/0/env/-"
value: {"name": "ENABLE_IPv4", "value": "true"}
- op: add
path: "/spec/template/spec/containers/0/env/-"
value: {"name": "ENABLE_IPv6", "value": "false"}
I also face above issue, but in my case I am using custom kube-proxy image. But when I reverted to default kube-proxy image and restart aws-node pods, all things works fine.
Why aws-node ipamd not giving any error related to communication if issue is with kube-proxy 🤔
Had a similar issue yesterday. AWS Systems Manager applied a patch to all of our nodes. This patch required a reboot of the instances. All instances came up healthy, but on three out of five the network was not working basically making the cluster unusable. Investigation lead me to issues like the one here or this AWS Knowledge Center entry from AWS.
Recycling all nodes resolved the issue. Did not try to just terminate the aws-node pods. Interestingly only one out of three clusters was affected. So probably difficult to reproduce.
What I also noticed: Why is aws-node mounting /var/run/dockershim.sock
even though we use containerd?
- AWS Node Image:
602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.10.1-eksbuild.1
- Default kube-proxy, default aws-node, etc.
Hey all 👋🏼 please be aware that this failure mode happens also when the IPs for a subnet are exhausted.
I just faced this and noticed I had mis-configured my worker groups to use a small subnet (/26) instead of a bigger one I intended to use (/18).
Also: Check you have the right security group attached to your nodes
For those coming here after upgrading EKS try re-applying the VPC CNI manifest file, for example: kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml
For me, the issue was policy/AmazonEKS_CNI_Policy-2022092909143815010000000b
My policy only allowed IPV6 like below.
{
"Statement": [
{
"Action": [
"ec2:DescribeTags",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypes",
"ec2:AssignIpv6Addresses"
],
"Effect": "Allow",
"Resource": "*",
"Sid": "IPV6"
},
{
"Action": "ec2:CreateTags",
"Effect": "Allow",
"Resource": "arn:aws:ec2:*:*:network-interface/*",
"Sid": "CreateTags"
}
],
"Version": "2012-10-17"
}
I changed the policy like below:
{
"Statement": [
{
"Action": [
"ec2:UnassignPrivateIpAddresses",
"ec2:ModifyNetworkInterfaceAttribute",
"ec2:DetachNetworkInterface",
"ec2:DescribeTags",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypes",
"ec2:DeleteNetworkInterface",
"ec2:CreateNetworkInterface",
"ec2:AttachNetworkInterface",
"ec2:AssignPrivateIpAddresses"
],
"Effect": "Allow",
"Resource": "*",
"Sid": "IPV4"
},
{
"Action": [
"ec2:DescribeTags",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypes",
"ec2:AssignIpv6Addresses"
],
"Effect": "Allow",
"Resource": "*",
"Sid": "IPV6"
},
{
"Action": "ec2:CreateTags",
"Effect": "Allow",
"Resource": "arn:aws:ec2:*:*:network-interface/*",
"Sid": "CreateTags"
}
],
"Version": "2012-10-17"
}
and it works! 😅
I've had the same problem these two weeks, has someone found a solution?
I've had the same problem these two weeks, has someone found a solution?
Can you please share the last few lines of ipamd logs before aws node restarts?
I've had the same problem these two weeks, has someone found a solution?
Can you please share the last few lines of ipamd logs before aws node restarts?
ipamd log:
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.43.0"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.43.0/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.60.1"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.60.1/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.47.2"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.47.2/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.46.131"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.46.131/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.61.196"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.61.196/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.49.6"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.49.6/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.41.135"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.41.135/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.38.218"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.38.218/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.39.157"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.39.157/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.59.213"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.59.213/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:653","msg":"Reconcile existing ENI eni-00023922abf62516c IP prefixes"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1351","msg":"Found prefix pool count 0 for eni eni-00023922abf62516c\n"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:653","msg":"Successfully Reconciled ENI/IP pool"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1396","msg":"IP pool stats: Total IPs/Prefixes = 87/0, AssignedIPs/CooldownIPs: 31/0, c.maxIPsPerENI = 29"}
command terminated with exit code 137
aws-node:
# kubectl logs -f aws-node-zdp6x --tail 30 -n kube-system
{"level":"info","ts":"2022-10-02T14:56:07.820Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-10-02T14:56:07.821Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-10-02T14:56:07.833Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-10-02T14:56:07.834Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-10-02T14:56:09.841Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:11.847Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:13.853Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:15.860Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:17.866Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:19.872Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:21.878Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:23.884Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:25.890Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:27.897Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:29.903Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:31.909Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:33.916Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:35.922Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:37.928Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:39.934Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:41.940Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:43.947Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:45.953Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:47.959Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:49.966Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
Event screenshots:
I used cluster-autoscaler for auto-scaling, k8s version is 1.22, also following the troubleshooting guide https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues and applying the suggestion
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.4/config/master/aws-k8s-cni.yaml
Interestingly, this failure usually only occurs on a certain node, and when I terminate the instance of that node and make it automatically expand again, it starts working.
But after running for a while, it will restart again
I am having the same issue applying the v.11.4 update. For those trying to do this, v1.11.3 and v1.11.4, make sure to substitute EKS_CLUSTER_NAME and VPC_ID with proper values, at least in my case, it didn't work otherwise.
For those coming here after upgrading EKS try re-applying the VPC CNI manifest file, for example: kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml
I've had the same problem recently, has someone found a solution?
@ermiaqasemi From this tutorial I chose to attach the AmazonEKS_CNI_Policy
to the aws-node
service account and I was getting the error.
I decided to try simply attaching it to the AmazonEKSNodeRole
, which apparently is the less recommended way to do it, but it works.
@itay-grudev Thanks for share, but in my case I don't think it's realtd to the node IAM roles since the role has already attached to the nodes!
@ermiaqasemi Did you also try a lower EKS version? Some people specifically reported problems with v1.11.4
. I created a new cluster with 1.10.4
which is finally working.
By EKS version you meant the CNI version, right? My EKS version is 1.22
and my CNI version is 1.10.2-eksbuild.1`. I didn't have this problem with EKS 1.21. @itay-grudev