terraform-kubernetes-installer
terraform-kubernetes-installer copied to clipboard
pod fail to start with network failure after etcd connectivity issue causes flannel/cni subnet lease to expire
An extended etcd connectivity issue can lead to pods failing to start.
How: flannel uses an expiring (24 hours) etcd keys to manage the subnets allocated it worker nodes
e.g. subnet allocated to k8s-worker-ad1-0:
etcdctl ls /flannel/network/subnets
/flannel/network/subnets/10.99.82.0-24
When the worker nodes lose connectivity to etcd (e.g. when the etcd-lb is malfunctioning) the /flannel/network/subnets/10.99.82.0-24
key TTL expires and the key will be gone:
etcdctl ls /flannel/network/subnets
When the connectivity to etcd is restored, a new key is created and distributed to the flannel service on each worker node:
etcdctl ls /flannel/network/subnets
/flannel/network/subnets/10.99.43.0-24
At this point, you'll see a number of symptoms including that new pods will fail to start at complaining (presumably about the pods on the old network):
Failed to setup network for pod \"hello-2093073260-lk3f2_default(9f0a4b8a-90f0-11e6-b54b-080027242396)\" using network plugins \"cni\": \"cni0\" already has an IP address different from 10.99.43.1/24,
You'll also see the network namespace container for the pod (that "pause" pod that starts alongside other containers) with the related error:
Failed to start with docker id 99a811606b51 with error: API error (500): Cannot start container 99a811606b51cdbeddbea14af474f0df432278ac9f73baea5d8ecaf5453f521e: cannot join network of a non running container: c8a4a648e63d5b18dcd8fad6fd2d70f466584f2a749bb4245bb86a6da1ceea55
It may also have something these files are doing or not doing when a new subnet is allocated to the worker:
./instances/k8smaster/scripts/flannel.service
./instances/k8sworker/scripts/flannel.service
./instances/k8smaster/scripts/cni-bridge.service
./instances/k8smaster/scripts/cni-bridge.sh
Next time check:
journalctl -u flannel.service | grep -i 'lease'
journalctl -u flannel | grep 'renewed'