Document Common Production Issues with Kubernetes (Kops)

Open osterman opened this issue 7 years ago • 0 comments

what

kube-dns intermittent outages
kube2iam rate limiting
kiam
growing/shrinking masters is complicated. risk destabilizing quorum => calico network stability
kops doesn't currently support rotating certificates
Upgrading clusters is the most sensitive time. This is mitigated by always testing upgrades first on staging clusters. For example, some upgrades have left some of the kube-system services, like kube-dns or calico in an unstable state. Manual actions are necessary to remediate. See "required actions" for all release notes.
Historically, dockerd has had issues with deadlocking. We haven't seen this lately, but in the past we'd have a reaper on dockerd daemon; today kops ships with this out-of-the-box.
Let's Encrypt rate limiting on staging environments
Running out of storage on root volume due to docker images not getting reaped
Random TCP timeouts. Could be related to kube-dns issue, https://github.com/kubernetes/kops/issues/5393

Kubernetes master lost connectivity to some workers. When the worker reconnected, it evicted all of its pods (as it should). However, kube-proxy did not update the iptables. As a result it kept trying to route traffic to the evicted local pods. This was pre-kops.
Regional outages at AWS: https://aws.amazon.com/message/41926/, https://aws.amazon.com/message/2329B7/

Jul 24 '18 18:07 osterman