flannel
flannel copied to clipboard
unable to find ENI that matches IP -- device or resource busy in worker nodes flannel version 0.16.1
We upgraded our cluster from 1.18 to 1.22 and hence upgraded flannel from 0.11.0 to 0.16.1
After refreshing the worker nodes those flannel pods are getting into crashloop.
Expected Behavior
It should run as is before and after the worker node instance refresh (AWS), as this is a daemonset and the same version (0.16.1) is running fine in master nodes which we just upgraded but not refreshed.
I0323 20:37:54.198122 1 awsvpc.go:88] Backend configured as: {
"Type": "aws-vpc"
}
I0323 20:37:54.198192 1 kube.go:339] Setting NodeNetworkUnavailable
I0323 20:37:54.898119 1 awsvpc.go:322] Found eni-0e5d8f2c847f22757 that has 10.69.101.137 IP address.
I0323 20:37:55.269139 1 awsvpc.go:79] Route table configured: false
I0323 20:37:55.373826 1 awsvpc.go:141] Found route table rtb-063c27bad0d39875c.
Current Behavior
getting Error registering network: unable to find ENI that matches for IP
.
The Error code is coming from here which then calls findENI function in the flannel code.
jenkins@ip-10-69-130-165:/home/ubuntu$ kubectl logs po/kube-flannel-ds-2dw7t -n kube-system -f
I0323 22:16:26.575521 1 main.go:218] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: help:false version:false autoDetectIPv4:false autoDetectIPv6:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true subnetFile:/run/flannel/subnet.env subnetDir: publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 charonExecutablePath: charonViciUri: iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W0323 22:16:26.575592 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0323 22:16:26.774576 1 kube.go:120] Waiting 10m0s for node controller to sync
I0323 22:16:26.774741 1 kube.go:378] Starting kube subnet manager
I0323 22:16:27.774762 1 kube.go:127] Node controller sync successful
I0323 22:16:27.774797 1 main.go:238] Created subnet manager: Kubernetes Subnet Manager - ip-10-69-102-84.us-west-2.compute.internal
I0323 22:16:27.774814 1 main.go:241] Installing signal handlers
I0323 22:16:27.774946 1 main.go:460] Found network config - Backend type: aws-vpc
I0323 22:16:27.774968 1 main.go:652] Determining IP address of default interface
I0323 22:16:27.775263 1 main.go:699] Using interface with name ens5 and address 10.69.102.84
I0323 22:16:27.775279 1 main.go:721] Defaulting external address to interface address (10.69.102.84)
I0323 22:16:27.775284 1 main.go:734] Defaulting external v6 address to interface address (<nil>)
I0323 22:16:27.775310 1 awsvpc.go:88] Backend configured as: {
"Type": "aws-vpc"
}
I0323 22:16:27.775381 1 kube.go:339] Setting NodeNetworkUnavailable
E0323 22:16:29.127731 1 main.go:326] Error registering network: unable to find ENI that matches the 10.69.102.84 IP address. RequestError: send request failed
caused by: Post "https://ec2.us-west-2.amazonaws.com/": dial tcp: lookup ec2.us-west-2.amazonaws.com: device or resource busy
I0323 22:16:29.127782 1 main.go:440] Stopping shutdownHandler...
W0323 22:16:29.127857 1 reflector.go:424] github.com/flannel-io/flannel/subnet/kube/kube.go:379: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
jenkins@ip-10-69-130-165:/home/ubuntu$
flannel pods are crashing in those worker nodes.
kube-system kube-flannel-ds-2dw7t 0/1 CrashLoopBackOff 19 (4m47s ago) 78m 10.69.102.84 ip-10-69-102-84.us-west-2.compute.internal <none> <none>
kube-system kube-flannel-ds-5p78m 0/1 CrashLoopBackOff 27 (3m12s ago) 117m 10.69.100.135 ip-10-69-100-135.us-west-2.compute.internal <none> <none>
configmap for flannel
jenkins@ip-10-69-130-165:/home/ubuntu$ kubectl describe cm/kube-flannel-cfg -n kube-system
Name: kube-flannel-cfg
Namespace: kube-system
Labels: app=flannel
tier=node
Annotations: <none>
Data
====
cni-conf.json:
----
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
net-conf.json:
----
{
"Network": "10.2.0.0/16",
"Backend": {
"Type": "aws-vpc"
}
}
BinaryData
====
Events: <none>
jenkins@ip-10-69-130-165:/home/ubuntu$
clusterRole and clusteroleBinding
jenkins@ip-10-69-130-165:/home/ubuntu$ kubectl describe clusterrole.rbac.authorization.k8s.io/flannel
Name: flannel
Labels: <none>
Annotations: <none>
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
pods [] [] [get]
nodes [] [] [list watch]
nodes/status [] [] [patch]
podsecuritypolicies.extensions [] [psp.flannel.unprivileged] [use]
jenkins@ip-10-69-130-165:/home/ubuntu$ kubectl describe clusterrolebinding.rbac.authorization.k8s.io/flannel
Name: flannel
Labels: <none>
Annotations: <none>
Role:
Kind: ClusterRole
Name: flannel
Subjects:
Kind Name Namespace
---- ---- ---------
ServiceAccount flannel kube-system
jenkins@ip-10-69-130-165:/home/ubuntu$
My policy associated with the worker role has
"Action": [
"ec2:Describe*",
"ec2:AttachVolume",
"ec2:DetachVolume",
"ec2:CreateRoute",
"ec2:DeleteRoute",
"ec2:ReplaceRoute",
"ec2:ModifyInstanceAttribute",
"ec2:ModifyNetworkInterfaceAttribute",
"ec2:CreateNetworkInterface",
"ec2:AttachNetworkInterface",
"ec2:DeleteNetworkInterface",
"ec2:DetachNetworkInterface",
"ec2:AssignPrivateIpAddresses",
"ec2:UnassignPrivateIpAddresses",
"ec2:CreateSnapshot",
"ec2:CreateSnapshots",
"ec2:CreateTags"
],
"Effect": "Allow",
"Resource": [
"*"
],
"Sid": "Ec2Restrictions"
I have also tried to perform an ec2 describe CLI command from that same node where the flannel is crashing.
root@ip-10-69-100-135:~# aws ec2 describe-instances --instance-ids i-XXXXXXX | jq '.Reservations[0].Instances[0].NetworkInterfaces'
[
{
"Attachment": {
"AttachTime": "2022-03-23T19:59:04+00:00",
"AttachmentId": "eni-attach-XXXXX",
"DeleteOnTermination": true,
"DeviceIndex": 0,
"Status": "attached",
"NetworkCardIndex": 0
},
"Description": "",
"Groups": [
{
"GroupName": "psupgrade1_private_sg",
"GroupId": "XXXXX"
}
],
"Ipv6Addresses": [],
"MacAddress": "XXXX",
"NetworkInterfaceId": "eni-XXXXXXXXX",
"OwnerId": "478147420456",
"PrivateDnsName": "ip-10-69-100-135.us-west-2.compute.internal",
"PrivateIpAddress": "10.69.100.135",
"PrivateIpAddresses": [
{
"Primary": true,
"PrivateDnsName": "ip-10-69-100-135.us-west-2.compute.internal",
"PrivateIpAddress": "10.69.100.135"
}
],
"SourceDestCheck": true,
"Status": "in-use",
"SubnetId": "XXXX",
"VpcId": "XXXXX",
"InterfaceType": "interface"
}
]
root@ip-10-69-100-135:~#```
## Your Environment
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Flannel version: 0.16.1
* Backend used (e.g. vxlan or udp): aws_vpc
* Etcd version: 3.5.1
* Kubernetes version (if used): 1.22.6
* Operating System and version: Ubuntu 18.04
One important thing noticed is flannel is unable to create cni0 and vethXXX into the bad nodes (worker nodes) where the same exists in the good node (master nodes).
BAD nodes (worker nodes)
root@ip-10-69-100-135:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
link/ether 02:78:37:ec:06:f7 brd ff:ff:ff:ff:ff:ff
inet 10.69.100.135/24 brd 10.69.100.255 scope global dynamic ens5
valid_lft 3158sec preferred_lft 3158sec
inet6 fe80::78:37ff:feec:6f7/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:1d:79:20:df brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:1dff:fe79:20df/64 scope link
valid_lft forever preferred_lft forever
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UNKNOWN group default
link/ether 5e:ea:09:1e:05:38 brd ff:ff:ff:ff:ff:ff
inet 10.2.3.0/32 brd 10.2.3.0 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::5cea:9ff:fe1e:538/64 scope link
valid_lft forever preferred_lft forever
root@ip-10-69-100-135:~#
Good node/s (master nodes)
4: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
link/ether d6:2e:f1:1c:9c:b3 brd ff:ff:ff:ff:ff:ff
inet 10.2.214.1/24 brd 10.2.214.255 scope global cni0
valid_lft forever preferred_lft forever
inet6 fe80::d42e:f1ff:fe1c:9cb3/64 scope link
valid_lft forever preferred_lft forever
6: veth7cc74bbc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master cni0 state UP group default
link/ether a2:eb:9f:cc:39:8b brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::a0eb:9fff:fecc:398b/64 scope link
valid_lft forever preferred_lft forever
7: vethbc56949f@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master cni0 state UP group default
link/ether ee:10:6a:39:09:77 brd ff:ff:ff:ff:ff:ff link-netnsid 2
inet6 fe80::ec10:6aff:fe39:977/64 scope link
valid_lft forever preferred_lft forever
1323: veth4630388b@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master cni0 state UP group default
link/ether 2e:b1:80:9f:ff:50 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::2cb1:80ff:fe9f:ff50/64 scope link
valid_lft forever preferred_lft forever
I am adding the worker nodes into the cluster using
$kubeadm join --config=/etc/kubernetes/kubeadm-join.yaml --node-name $(hostname -f)
$
$cat /etc/kubernetes/kubeadm-join.yaml
apiVersion: kubeadm.k8s.io/v1beta3
caCertPath: /etc/kubernetes/pki/ca.crt
discovery:
bootstrapToken:
apiServerEndpoint: XXXXXXX:6443
token: 'XXXXX'
unsafeSkipCAVerification: true
timeout: 5m0s
tlsBootstrapToken: 'XXXXXXXXX'
kind: JoinConfiguration
nodeRegistration:
name: ip-10-69-100-135.us-west-2.compute.internal
criSocket: /var/run/dockershim.sock
kubeletExtraArgs:
cloud-provider: external
node-ip: 10.69.100.135
$
ok got flannel pods working in the failed worker nodes.
the issue was it needs --cgroup-driver=systemd
and --resolv-conf=/run/systemd/resolve/resolv.conf"
inside /var/lib/kubelet/kubeadm-flags.env
file.
earlier below were the entries for the env file.
KUBELET_KUBEADM_ARGS="--cloud-provider=external --hostname-override=ip-10-69-100-135.us-west-2.compute.internal --network-plugin=cni --node-ip=10.69.100.135 --pod-infra-container-image=k8s.gcr.io/pause:3.5"
once I added those entries it started working.
AFAIK, from 1.19 onwards these values doesn't get added to kubeadm-flags.env
instead it gets into /var/lib/kubelet/config.yaml
# cat /var/lib/kubelet/config.yaml | egrep -iw 'cgroupdriver|resolv'
cgroupDriver: systemd
resolvConf: /etc/resolv.conf
#
As mentioned, I am using 0.16.1 flannel version taken directly from https://github.com/flannel-io/flannel/blob/v0.16.1/Documentation/kube-flannel.yml
but with the aws-vpc
backend as we are running worker nodes in AWS.
net-conf.json:
----
{
"Network": "10.2.0.0/16",
"Backend": {
"Type": "aws-vpc"
}
}
couple of questions:
-
Is this a bug with flannel ? that it somehow refers to
kubeadm-flags.env
to get cgroup and resolv-conf value instead of going into/var/lib/kubelet/config.yaml
-
If the answer is No for point 1, what should i do differently for 1.22 from k8s side to make flannel work.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.