docs
docs copied to clipboard
Document Common Production Issues with Kubernetes (Kops)
what
-
kube-dnsintermittent outages -
kube2iamrate limiting -
kiam - growing/shrinking masters is complicated. risk destabilizing quorum => calico network stability
- kops doesn't currently support rotating certificates
- Upgrading clusters is the most sensitive time. This is mitigated by always testing upgrades first on staging clusters. For example, some upgrades have left some of the kube-system services, like kube-dns or calico in an unstable state. Manual actions are necessary to remediate. See "required actions" for all release notes.
- Historically,
dockerdhas had issues with deadlocking. We haven't seen this lately, but in the past we'd have a reaper ondockerddaemon; todaykopsships with this out-of-the-box. - Let's Encrypt rate limiting on staging environments
- Running out of storage on root volume due to docker images not getting reaped
- Random TCP timeouts. Could be related to
kube-dnsissue, https://github.com/kubernetes/kops/issues/5393
other incidents
- Kubernetes master lost connectivity to some workers. When the worker reconnected, it evicted all of its pods (as it should). However,
kube-proxydid not update theiptables. As a result it kept trying to route traffic to the evicted local pods. This was pre-kops. - Regional outages at AWS: https://aws.amazon.com/message/41926/, https://aws.amazon.com/message/2329B7/