eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

Add ready plugin to CoreDNS and use health.lameduck for a safer shutdown

Open wind0r opened this issue 3 years ago • 2 comments

Description

We noticed that sometimes our applications had DNS issues. This was caused by sending traffic to CoreDNS instances that just stopped.

We compared the CoreDNS deployment in EKS with our bare-metal clusters. The result was that CoreDNS in our EKS-clusters build with eksctl was using the health plugin for both probes instead of using the ready plugin and the health plugin with Lameduck enabled. With Lameduck CoreDNS will delay the shutdown so that in-flight requests will not be dropped and kube-proxy has some time to update the routing.

Those settings are the same as suggested within the CoreDNS Deployment Files and the Helm chart.

Also, we updated all our Clusters with this MR, and we didn't notice any issues. eksctl utils update-coredns --cluster $CLUSTER --approve

CoreDNS Deployment Files CoreDNS Helm Chart CoreDNS Health Plugin CoreDNS Ready Plugin

Checklist

  • [X] Added tests that cover your change (if possible)
  • [X] Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • [X] Manually tested
  • [X] Made sure the title of the PR is a good description that can go into the release notes
  • [ ] (Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

wind0r avatar Sep 13 '22 10:09 wind0r

Added a new commit which also changes the Resources Settings.

CoreDNS Helm Chart sets Resources Requests and Limits to the same Value. I think eksctl should do the same so that CoreDNS Pods use QoS Guaranteed.

        resources:
          limits:
            cpu: 100m
            memory: 128Mi
          requests:
            cpu: 100m
            memory: 128Mi

wind0r avatar Sep 14 '22 08:09 wind0r

As someone who has been bit by DNS issues when coredns pods are terminated, I applaud this effort :clap:

I had a quick once over of your MR, and it mostly looks good from what I can tell.

One part I'm unsure on is the CPU limits. I think they could cause unintended side effects in some cases. This kubespray issue sums up my concerns. I'm not an expert here, so I'm not 100 percent sure this issue is legitimate.

pnmatich avatar Sep 23 '22 17:09 pnmatich

As someone who has been bit by DNS issues when coredns pods are terminated, I applaud this effort 👏

I had a quick once over of your MR, and it mostly looks good from what I can tell.

One part I'm unsure on is the CPU limits. I think they could cause unintended side effects in some cases. This kubespray issue sums up my concerns. I'm not an expert here, so I'm not 100 percent sure this issue is legitimate.

@wind0r thanks for the contribution, as raised in this comment here, do you think the CPU limits could cause unintended side effects in some cases? Have you identified any cases and are you working on adding any further changes here?

Himangini avatar Oct 13 '22 14:10 Himangini

We used this MR now for some weeks in Production without issues. Still, I just now dropped the CPU Change. The Deployment uses priorityClassName system-node-critical, so QoS Guaranteed isn't really needed.

Also, it makes this change less risky for bigger/higher load clusters that didn't add a CoreDNS autoscaler.

@Himangini I think this is ready to merge and I am not working on further change. I am open to feedback and would help to get this merged so I can run upstream eksctl again instead of our fork.

wind0r avatar Oct 17 '22 08:10 wind0r

@Himangini anything I need to do? Is something missing?

wind0r avatar Oct 25 '22 12:10 wind0r

@wind0r you have ticked the modified docs check but I can't see any changes, are you going to add any docs for users so they can upgrade their CoreDNS plugin settings?

Himangini avatar Oct 26 '22 12:10 Himangini

I think the current docs already describe the update process. Users just need to run eksctl utils update-coredns --cluster=<clusterName> I ticked the checkbox because I think no update is need (as required) but if something is missing I am happy to change it.

wind0r avatar Oct 26 '22 13:10 wind0r