docs icon indicating copy to clipboard operation
docs copied to clipboard

Document Common Production Issues with Kubernetes (Kops)

Open osterman opened this issue 7 years ago • 0 comments

what

  • kube-dns intermittent outages
  • kube2iam rate limiting
  • kiam
  • growing/shrinking masters is complicated. risk destabilizing quorum => calico network stability
  • kops doesn't currently support rotating certificates
  • Upgrading clusters is the most sensitive time. This is mitigated by always testing upgrades first on staging clusters. For example, some upgrades have left some of the kube-system services, like kube-dns or calico in an unstable state. Manual actions are necessary to remediate. See "required actions" for all release notes.
  • Historically, dockerd has had issues with deadlocking. We haven't seen this lately, but in the past we'd have a reaper on dockerd daemon; today kops ships with this out-of-the-box.
  • Let's Encrypt rate limiting on staging environments
  • Running out of storage on root volume due to docker images not getting reaped
  • Random TCP timeouts. Could be related to kube-dns issue, https://github.com/kubernetes/kops/issues/5393

other incidents

referneces

release notes

osterman avatar Jul 24 '18 18:07 osterman