terraform-kubernetes-installer icon indicating copy to clipboard operation
terraform-kubernetes-installer copied to clipboard

pod fail to start with network failure after etcd connectivity issue causes flannel/cni subnet lease to expire

Open jlamillan opened this issue 6 years ago • 1 comments

An extended etcd connectivity issue can lead to pods failing to start.

How: flannel uses an expiring (24 hours) etcd keys to manage the subnets allocated it worker nodes

e.g. subnet allocated to k8s-worker-ad1-0:

etcdctl ls /flannel/network/subnets
/flannel/network/subnets/10.99.82.0-24

When the worker nodes lose connectivity to etcd (e.g. when the etcd-lb is malfunctioning) the /flannel/network/subnets/10.99.82.0-24 key TTL expires and the key will be gone:

etcdctl ls /flannel/network/subnets

When the connectivity to etcd is restored, a new key is created and distributed to the flannel service on each worker node:

etcdctl ls /flannel/network/subnets
/flannel/network/subnets/10.99.43.0-24

At this point, you'll see a number of symptoms including that new pods will fail to start at complaining (presumably about the pods on the old network):

Failed to setup network for pod \"hello-2093073260-lk3f2_default(9f0a4b8a-90f0-11e6-b54b-080027242396)\" using network plugins \"cni\": \"cni0\" already has an IP address different from 10.99.43.1/24,

You'll also see the network namespace container for the pod (that "pause" pod that starts alongside other containers) with the related error:

Failed to start with docker id 99a811606b51 with error: API error (500): Cannot start container 99a811606b51cdbeddbea14af474f0df432278ac9f73baea5d8ecaf5453f521e: cannot join network of a non running container: c8a4a648e63d5b18dcd8fad6fd2d70f466584f2a749bb4245bb86a6da1ceea55

It may also have something these files are doing or not doing when a new subnet is allocated to the worker:

./instances/k8smaster/scripts/flannel.service
./instances/k8sworker/scripts/flannel.service
./instances/k8smaster/scripts/cni-bridge.service
./instances/k8smaster/scripts/cni-bridge.sh

jlamillan avatar Sep 09 '17 18:09 jlamillan

Next time check:

journalctl -u flannel.service | grep -i 'lease'
journalctl -u flannel | grep 'renewed'

jlamillan avatar Jan 09 '18 01:01 jlamillan