kured icon indicating copy to clipboard operation
kured copied to clipboard

Custom drain command?

Open harrytang opened this issue 1 year ago • 12 comments

Hi there,

I have been using Kured and I find it to be a great tool for automating node reboots in a Kubernetes cluster.

I was wondering if there are any plans to add support for custom drain commands in Kured? It would be really helpful if we could specify our own custom drain command that Kured would execute before rebooting a node.

If this is not currently on the roadmap, I would love to know if it's something that the Kured development team would consider adding in the future.

Thank you for your time and for your work on Kured. I look forward to hearing back from you.

Best regards, Harry

harrytang avatar Feb 28 '23 08:02 harrytang

Hi Harry, we could be open to that. Could you provide an example of what you'd like to do in addition to (or instead of) the normal k8s "drain node" behavior?

jackfrancis avatar Feb 28 '23 21:02 jackfrancis

Hi,

We are currently using Longhorn Storage in our cluster, and while a node is being drained, we still need some components functioning so that the volumes can be properly detached. (see https://longhorn.io/docs/1.4.0/volumes-and-nodes/maintenance/#updating-the-node-os-or-container-runtime)

We normally use this drain command:

kubectl drain NODEX --delete-emptydir-data --ignore-daemonsets --pod-selector='app!=csi-attacher,app!=csi-provisioner,longhorn.io/component!=instance-manager'

Hope you find this helpful.

Thank1

harrytang avatar Feb 28 '23 22:02 harrytang

That helps, thanks @harrytang!

I'll think about how we might put something like this together, stay tuned!

jackfrancis avatar Feb 28 '23 23:02 jackfrancis

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

github-actions[bot] avatar Apr 30 '23 01:04 github-actions[bot]

Github keep

harrytang avatar Apr 30 '23 04:04 harrytang

Are there any additional suggestions on how to deal with this issue? If you have volume that is not replicated the node fails to drain due to the pod disruption budget. This happens over time when a volume is not used as much and you only have one replica. I see that my predecessors used to stop dockerd and iscsid using Ansible to patch and reboot nodes.

I tried setting forceReboot=true, but it does not seem to help.

EDIT: I did find a setting in Longhorn for Allow Node Drain with the Last Healthy Replica. I'll test this and try to remember to report back here.

ddsmith2-eprod avatar May 16 '23 13:05 ddsmith2-eprod

I am in the same boat as everyone else regarding Longhorn draining. "Allow Node Drain with the Last Healthy Replica" is not solving this issue for me though.

docbobo avatar Jun 27 '23 06:06 docbobo

Is anybody taking a look at this? I'm currently attempting to find ways of creating alert manager silences and then removing them when the node comes back up, so I'd want to have a pre reboot command and a post reboot command just like there are for labels. I'd imagine it could tie into the same hooks that the labels use and take a similar approach to how users are able to specify their own reboot command.

I'd be happy to pull something together for this, just want to see if this approach is acceptable to folks.

tylerauerbeck avatar Aug 11 '23 16:08 tylerauerbeck

Hm, it depends a bit on how you want to implement/use pre- and post-reboot commands. Do you want to call a command on the host (with nsenter as for the sentinel- and reboot-commands) or should the command work inside the container? We're currently working on restricting privileges of kured and finding a way to avoid commands on the host with nsenter. Otherwise, there are no plans to add commands/binaries to our own docker-image which can be used within the container.

ckotzbauer avatar Aug 11 '23 17:08 ckotzbauer

We are looking for the same kind of features (pre-reboot).

The usecase is that we must sometime switchover leader database before rebooting a node. We could have this action triggered by Kured automatically before rebooting.

@ckotzbauer There are various ways to do that without changing the kured container image. For example, it could be a pod template configuration, and the pod/job are then executed (with no privilege) and the controller would wait for them to terminate successfully.

--pre-boot=' {containers: [image: switchover-pg,
                                          command: ["switch-db --node-name=$(NODE_ID)"]
                       }
--pre-boot=' {containers: [image: silence-alerts,
                                          command: ["turn-off-alerts --node-name=$(NODE_ID)"]
                       }
--post-boot=' {containers: [image: silence-alerts,
                                          command: ["turn-on-alerts --node-name=$(NODE_ID)"]
                       }

I'm sure there are other ways to define/execute those kinds of commands, it's just a quick example.

IMO, the feature would be useful. it could also reduce a bit the need for you to implement too many integrations upstream.

ant31 avatar Aug 15 '23 11:08 ant31

Any plans to add this feature to roadmap ?

kingnarmer avatar Apr 11 '24 14:04 kingnarmer

@kingnarmer When there's a good concept and someone who needs this is able to support here with a PR, it can be implemented anytime.

ckotzbauer avatar Apr 26 '24 13:04 ckotzbauer