flannel unable to find ENI that matches IP -- device or resource busy in worker nodes flannel version 0.16.1

unable to find ENI that matches IP -- device or resource busy in worker nodes flannel version 0.16.1

Open shomeprasanjit opened this issue 2 years ago • 2 comments

We upgraded our cluster from 1.18 to 1.22 and hence upgraded flannel from 0.11.0 to 0.16.1

After refreshing the worker nodes those flannel pods are getting into crashloop.

Expected Behavior

It should run as is before and after the worker node instance refresh (AWS), as this is a daemonset and the same version (0.16.1) is running fine in master nodes which we just upgraded but not refreshed.

I0323 20:37:54.198122       1 awsvpc.go:88] Backend configured as: {
    "Type": "aws-vpc"
  }
I0323 20:37:54.198192       1 kube.go:339] Setting NodeNetworkUnavailable
I0323 20:37:54.898119       1 awsvpc.go:322] Found eni-0e5d8f2c847f22757 that has 10.69.101.137 IP address.
I0323 20:37:55.269139       1 awsvpc.go:79] Route table configured: false
I0323 20:37:55.373826       1 awsvpc.go:141] Found route table rtb-063c27bad0d39875c.

Current Behavior

getting Error registering network: unable to find ENI that matches for IP.

The Error code is coming from here which then calls findENI function in the flannel code.

jenkins@ip-10-69-130-165:/home/ubuntu$ kubectl logs po/kube-flannel-ds-2dw7t -n kube-system -f
I0323 22:16:26.575521       1 main.go:218] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: help:false version:false autoDetectIPv4:false autoDetectIPv6:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true subnetFile:/run/flannel/subnet.env subnetDir: publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 charonExecutablePath: charonViciUri: iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W0323 22:16:26.575592       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0323 22:16:26.774576       1 kube.go:120] Waiting 10m0s for node controller to sync
I0323 22:16:26.774741       1 kube.go:378] Starting kube subnet manager
I0323 22:16:27.774762       1 kube.go:127] Node controller sync successful
I0323 22:16:27.774797       1 main.go:238] Created subnet manager: Kubernetes Subnet Manager - ip-10-69-102-84.us-west-2.compute.internal
I0323 22:16:27.774814       1 main.go:241] Installing signal handlers
I0323 22:16:27.774946       1 main.go:460] Found network config - Backend type: aws-vpc
I0323 22:16:27.774968       1 main.go:652] Determining IP address of default interface
I0323 22:16:27.775263       1 main.go:699] Using interface with name ens5 and address 10.69.102.84
I0323 22:16:27.775279       1 main.go:721] Defaulting external address to interface address (10.69.102.84)
I0323 22:16:27.775284       1 main.go:734] Defaulting external v6 address to interface address (<nil>)
I0323 22:16:27.775310       1 awsvpc.go:88] Backend configured as: {
    "Type": "aws-vpc"
  }
I0323 22:16:27.775381       1 kube.go:339] Setting NodeNetworkUnavailable
E0323 22:16:29.127731       1 main.go:326] Error registering network: unable to find ENI that matches the 10.69.102.84 IP address. RequestError: send request failed
caused by: Post "https://ec2.us-west-2.amazonaws.com/": dial tcp: lookup ec2.us-west-2.amazonaws.com: device or resource busy
I0323 22:16:29.127782       1 main.go:440] Stopping shutdownHandler...
W0323 22:16:29.127857       1 reflector.go:424] github.com/flannel-io/flannel/subnet/kube/kube.go:379: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
jenkins@ip-10-69-130-165:/home/ubuntu$

flannel pods are crashing in those worker nodes.

kube-system   kube-flannel-ds-2dw7t                                                 0/1     CrashLoopBackOff    19 (4m47s ago)   78m     10.69.102.84    ip-10-69-102-84.us-west-2.compute.internal    <none>           <none>
kube-system   kube-flannel-ds-5p78m                                                 0/1     CrashLoopBackOff    27 (3m12s ago)   117m    10.69.100.135   ip-10-69-100-135.us-west-2.compute.internal   <none>           <none>

configmap for flannel

jenkins@ip-10-69-130-165:/home/ubuntu$ kubectl describe cm/kube-flannel-cfg -n kube-system
Name:         kube-flannel-cfg
Namespace:    kube-system
Labels:       app=flannel
              tier=node
Annotations:  <none>

Data
====
cni-conf.json:
----
{
  "name": "cbr0",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "flannel",
      "delegate": {
        "hairpinMode": true,
        "isDefaultGateway": true
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    }
  ]
}

net-conf.json:
----
{
  "Network": "10.2.0.0/16",
  "Backend": {
    "Type": "aws-vpc"
  }
}


BinaryData
====

Events:  <none>
jenkins@ip-10-69-130-165:/home/ubuntu$

clusterRole and clusteroleBinding

jenkins@ip-10-69-130-165:/home/ubuntu$ kubectl describe clusterrole.rbac.authorization.k8s.io/flannel
Name:         flannel
Labels:       <none>
Annotations:  <none>
PolicyRule:
  Resources                       Non-Resource URLs  Resource Names              Verbs
  ---------                       -----------------  --------------              -----
  pods                            []                 []                          [get]
  nodes                           []                 []                          [list watch]
  nodes/status                    []                 []                          [patch]
  podsecuritypolicies.extensions  []                 [psp.flannel.unprivileged]  [use]
jenkins@ip-10-69-130-165:/home/ubuntu$ kubectl describe clusterrolebinding.rbac.authorization.k8s.io/flannel
Name:         flannel
Labels:       <none>
Annotations:  <none>
Role:
  Kind:  ClusterRole
  Name:  flannel
Subjects:
  Kind            Name     Namespace
  ----            ----     ---------
  ServiceAccount  flannel  kube-system
jenkins@ip-10-69-130-165:/home/ubuntu$

My policy associated with the worker role has

            "Action": [
                "ec2:Describe*",
                "ec2:AttachVolume",
                "ec2:DetachVolume",
                "ec2:CreateRoute",
                "ec2:DeleteRoute",
                "ec2:ReplaceRoute",
                "ec2:ModifyInstanceAttribute",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:CreateNetworkInterface",
                "ec2:AttachNetworkInterface",
                "ec2:DeleteNetworkInterface",
                "ec2:DetachNetworkInterface",
                "ec2:AssignPrivateIpAddresses",
                "ec2:UnassignPrivateIpAddresses",
                "ec2:CreateSnapshot",
                "ec2:CreateSnapshots",
                "ec2:CreateTags"
            ],
            "Effect": "Allow",
            "Resource": [
                "*"
            ],
            "Sid": "Ec2Restrictions"

I have also tried to perform an ec2 describe CLI command from that same node where the flannel is crashing.

root@ip-10-69-100-135:~#  aws ec2 describe-instances --instance-ids i-XXXXXXX | jq '.Reservations[0].Instances[0].NetworkInterfaces'
[
  {
    "Attachment": {
      "AttachTime": "2022-03-23T19:59:04+00:00",
      "AttachmentId": "eni-attach-XXXXX",
      "DeleteOnTermination": true,
      "DeviceIndex": 0,
      "Status": "attached",
      "NetworkCardIndex": 0
    },
    "Description": "",
    "Groups": [
      {
        "GroupName": "psupgrade1_private_sg",
        "GroupId": "XXXXX"
      }
    ],
    "Ipv6Addresses": [],
    "MacAddress": "XXXX",
    "NetworkInterfaceId": "eni-XXXXXXXXX",
    "OwnerId": "478147420456",
    "PrivateDnsName": "ip-10-69-100-135.us-west-2.compute.internal",
    "PrivateIpAddress": "10.69.100.135",
    "PrivateIpAddresses": [
      {
        "Primary": true,
        "PrivateDnsName": "ip-10-69-100-135.us-west-2.compute.internal",
        "PrivateIpAddress": "10.69.100.135"
      }
    ],
    "SourceDestCheck": true,
    "Status": "in-use",
    "SubnetId": "XXXX",
    "VpcId": "XXXXX",
    "InterfaceType": "interface"
  }
]
root@ip-10-69-100-135:~#```


## Your Environment
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Flannel version: 0.16.1
* Backend used (e.g. vxlan or udp): aws_vpc
* Etcd version: 3.5.1
* Kubernetes version (if used): 1.22.6
* Operating System and version: Ubuntu 18.04

Mar 23 '22 23:03 shomeprasanjit

One important thing noticed is flannel is unable to create cni0 and vethXXX into the bad nodes (worker nodes) where the same exists in the good node (master nodes).

BAD nodes (worker nodes)

root@ip-10-69-100-135:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 02:78:37:ec:06:f7 brd ff:ff:ff:ff:ff:ff
    inet 10.69.100.135/24 brd 10.69.100.255 scope global dynamic ens5
       valid_lft 3158sec preferred_lft 3158sec
    inet6 fe80::78:37ff:feec:6f7/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:1d:79:20:df brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:1dff:fe79:20df/64 scope link
       valid_lft forever preferred_lft forever
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UNKNOWN group default
    link/ether 5e:ea:09:1e:05:38 brd ff:ff:ff:ff:ff:ff
    inet 10.2.3.0/32 brd 10.2.3.0 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::5cea:9ff:fe1e:538/64 scope link
       valid_lft forever preferred_lft forever
root@ip-10-69-100-135:~#

Good node/s (master nodes)

4: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
    link/ether d6:2e:f1:1c:9c:b3 brd ff:ff:ff:ff:ff:ff
    inet 10.2.214.1/24 brd 10.2.214.255 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::d42e:f1ff:fe1c:9cb3/64 scope link
       valid_lft forever preferred_lft forever
6: veth7cc74bbc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master cni0 state UP group default
    link/ether a2:eb:9f:cc:39:8b brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::a0eb:9fff:fecc:398b/64 scope link
       valid_lft forever preferred_lft forever
7: vethbc56949f@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master cni0 state UP group default
    link/ether ee:10:6a:39:09:77 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::ec10:6aff:fe39:977/64 scope link
       valid_lft forever preferred_lft forever
1323: veth4630388b@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master cni0 state UP group default
    link/ether 2e:b1:80:9f:ff:50 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::2cb1:80ff:fe9f:ff50/64 scope link
       valid_lft forever preferred_lft forever

I am adding the worker nodes into the cluster using

 $kubeadm join --config=/etc/kubernetes/kubeadm-join.yaml --node-name $(hostname -f)
 $
 
 $cat /etc/kubernetes/kubeadm-join.yaml
apiVersion: kubeadm.k8s.io/v1beta3
caCertPath: /etc/kubernetes/pki/ca.crt
discovery:
  bootstrapToken:
    apiServerEndpoint: XXXXXXX:6443
    token: 'XXXXX'
    unsafeSkipCAVerification: true
  timeout: 5m0s
  tlsBootstrapToken: 'XXXXXXXXX'
kind: JoinConfiguration
nodeRegistration:
  name: ip-10-69-100-135.us-west-2.compute.internal
  criSocket: /var/run/dockershim.sock
  kubeletExtraArgs:
    cloud-provider: external
    node-ip: 10.69.100.135
$

Mar 24 '22 01:03 shomeprasanjit

ok got flannel pods working in the failed worker nodes.

the issue was it needs --cgroup-driver=systemd and --resolv-conf=/run/systemd/resolve/resolv.conf" inside /var/lib/kubelet/kubeadm-flags.env file.

earlier below were the entries for the env file.

KUBELET_KUBEADM_ARGS="--cloud-provider=external --hostname-override=ip-10-69-100-135.us-west-2.compute.internal --network-plugin=cni --node-ip=10.69.100.135 --pod-infra-container-image=k8s.gcr.io/pause:3.5"

once I added those entries it started working.

AFAIK, from 1.19 onwards these values doesn't get added to kubeadm-flags.env instead it gets into /var/lib/kubelet/config.yaml

# cat /var/lib/kubelet/config.yaml | egrep -iw 'cgroupdriver|resolv'
cgroupDriver: systemd
resolvConf: /etc/resolv.conf
#

As mentioned, I am using 0.16.1 flannel version taken directly from https://github.com/flannel-io/flannel/blob/v0.16.1/Documentation/kube-flannel.yml

but with the aws-vpc backend as we are running worker nodes in AWS.

net-conf.json:
----
{
  "Network": "10.2.0.0/16",
  "Backend": {
    "Type": "aws-vpc"
  }
}

couple of questions:

Is this a bug with flannel ? that it somehow refers to kubeadm-flags.env to get cgroup and resolv-conf value instead of going into /var/lib/kubelet/config.yaml
If the answer is No for point 1, what should i do differently for 1.22 from k8s side to make flannel work.

Mar 24 '22 18:03 shomeprasanjit

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 25 '23 20:01 stale[bot]

flannel flannel copied to clipboard

unable to find ENI that matches IP -- device or resource busy in worker nodes flannel version 0.16.1

Expected Behavior

Current Behavior

flannel
flannel copied to clipboard