draino
draino copied to clipboard
Auto healing usercase with Draino
Hi @negz , as I mentioned in another issue. I'm trying to leverage Draino to do auto healing/repairing plus AutoScaler, just like the case documented in Draino's README.MD. But now I have a question based on my testing:
After the node has been cordoned, drained by Draino, and removed by AutoScaler. How to tell AutoSclaer to create a new node? Would you mind sharing your arguments of AutoScaler? Thanks.
@openstacker As the node is drained each of its pods are evicted. If the evicted pods are owned by a controller (e.g. a Deployment, or a StatefulSet) that controller will notice that there are now less available than desired replicas, and create a new replica (pod). The Kubernetes scheduler will then attempt to schedule these new pods. If they can be scheduled successfully then no replacement node is needed. If not they'll be marked as unschedulable, and the Cluster Autoscaler will attempt to create a new node that will be able to schedule them.
I don't work for Planet Labs anymore, so I can't provide an example Autoscaler configuration, but perhaps @jacobstr or @gleeco could show you the flags they run with.
@negz Thanks for the quick answer. So IIUC, for the auto-healing scenario, the key part is healing, but the node count doesn't have to be back to the same node count. For example, if current node count is 5 (N1-N5) and the min node count set in AutoScaler is 3, and now if node N3 is down, it will be cordoned, drained by Draino, then deleted by AutoScaler after all pods evicted. But if AutoScaler thinks the current workload running on 4 nodes is good enough, then no new node will be created. Otherwise, AutoScaler will call the cloud provider API to create a new node to balance the workload. Anything I missed? Thanks.
@openstacker That's correct! The NPD + Draino + Autoscaler combination will ensure all your pods have somewhere to run after a node is drained and terminated, but will only ensure you return to the same number of nodes you started with if doing so is necessary.
@negz Thank you very much for the clarification. Do you think it's necessary to add a note to README to complete it to help others understand this scenario correctly?
@openstacker It sounds like it could be! I think the current mention of the Cluster Autoscaler presumes more familiarity than some folks may have.
No problem. Thanks @negz . BTW, I have a question about where to run Draino. In my testing, I have seen a case that all worker nodes are in NotReady, and there is no place to run the Draino pod. So I'm thinking could I run it as a daemonset or deployment ? Thanks.
@openstacker At Planet we typically ran it as a Deployment, though if all nodes in the cluster were NotReady the Draino pod would still not run. We mitigated that risk by running 'supporting infrastructure' deployments on a dedicated set of nodes that did not run regular workloads and thus were less liable to break.
One issue with running Draino as a DaemonSet is that there's currently no master lease support in Draino, so if multiple Drainos run at the same time they'll all race to cordon and drain any node that breaks.
@negz re "no master lease support', I think it's a good and necessary feature, is there anybody working on that? mind me taking it? Now we're having kubelet running on master nodes to host some critical pods, like auth, autoscaler, etc, I'd like to run Draino on there as well. Though it's not necessary to get multi draino at Day1, but it's definitely good to have.
@openstacker Please do! I raised https://github.com/planetlabs/draino/issues/39 to track.
Leader election is a good thing™️ but Draino hasn't suffered for lack of it yet. There have been some stated desires to make Draino do a few stateful things that might require it to be more deliberate (perhaps e.g. #27) in the future, but for what it's worth, we've seen any issues running in production for the past six months.
With respect to the auto-healing use case, I think the documentation does a good job describing the interaction with the cluster-autoscaler.
The relevant cluster-autoscaler flags, pulled from our deployment manifest. Note the $(CLUSTER_NAME) is an environment variable declared elsewhere in the manifest and uses a special substitution syntax.
'-v=2',
'--logtostderr',
'--balance-similar-node-groups=true',
'--node-group-auto-discovery=mig:namePrefix=$(CLUSTER_NAME)-,min=$(MIN_NODES),max=$(MAX_NODES)',
'--scale-down-unneeded-time=5m',
'--scale-down-delay-after-failure=1m',
'--scale-down-delay-after-add=1m',
And also as a small pro-tip, when I want to get documentation for various kube components I just .e.g.
docker run k8s.gcr.io/cluster-autoscaler:v1.13.1 /cluster-autoscaler --help
For example, a relevant flag when scaling down:
--scale-down-utilization-threshold float Node utilization level, defined as sum of requested resources divided by capacity, below which a node can be considered for scale down (default 0.5)
Note that it's not unexpected that after a wedged node is removed, for the cluster-autoscaler to not necessarily replace it. It's entirely possible that the drained pods will be scheduled elsewhere in the cluster. Generally, if scale-up is required:
- You'll first see
Pendingpods. - If you
kubectl describethese pods, they should indicate that they are pending because of insufficient resources. - You'll see cluster-autoscaler logs to the effect of "I see some pending pods that need more resources, I noticed that if I scaled <gcp/aws instance-group>, that it would fit so imma go and do that."