piraeus-operator [Discussion] Orderly shutdown on non-systemd cluster (Talos Linux)

Hello

I noticed over time that I have a noticeable amount of issues when restarting my nodes (eg. updates, etc) with nodes coming back online with corrupted XFS filesystems and/or split brains. The documentation for non-systemd OSes (Talos) states :

Talos does not ship with Systemd, so everything Systemd related needs to be removed

but does not offer any replacement for e.g. drbd-shutdown-guard

Assuming this is the root cause of the issues described above. How would you go on about implementing orderly shutdown on a non-systemd OS such as Talos ??

I am thinking about implementing something along the lines of :

...
spec:
  containers:
  - name: drbd-reactor
    image: drbd-reactor-image
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "make sure you exit cleanly here"]

does that even make sense ?

Jul 31 '25 08:07 bernardgut

The issue with all of these approaches on the container level is that you can't really decide if the container stops because the node shuts down, or because your workload restarts for some other reason (image update, etc..). This is really a question for the Talos Linux devs....

Jul 31 '25 08:07 WanzenBug

Its really not an OS level issue IMO. You have a storage system that is supposed to handle graceful shutdowns in the context of a Kubernetes cluster (including non-systemd). IMO it does not matter the reason why the pod is being shut down : It should always trigger an orderly shutdown and result in a consistent storage state. Unless I am missing something, this should be a pre-condition of running in production in a container environment, regardless of lower level architecture?

EDIT: Maybe let me rephrase the above : why does the pod restart reason matters ? Why not set everything to secondary on pod shutdown and pick a master (leader election) on pod startup ? Why should it matter? These things restart once every month at most in production...

Jul 31 '25 10:07 bernardgut

that is supposed to handle graceful shutdowns in the context of a Kubernetes cluster (including non-systemd).

For us to handle graceful shutdown, the system needs to give us the possibility to detect a graceful shutdown. For systemd systems that is the current drbd-shutdown-guard service. For non-systemd services, we have no replacement. If you have an idea, we can implement it. For the reasons below, I don't see the proposal as a viable option.

IMO it does not matter the reason why the pod is being shut down : It should always trigger an orderly shutdown and result in a consistent storage state.

That would be incredibly disruptive. This "shutdown" involves unmounting and removing all DRBD devices. So every time this pod restarts, you would also have to remove and recreate all Pods using volumes. That does not sound like a good user experience to me.

Why not set everything to secondary on pod shutdown and pick a master (leader election) on pod startup ? Why should it matter? These things restart once every month at most in production...

I think you are misunderstanding what this is used for and why we can't just do it on every pod restart. We use DRBD to replica data. All volumes use DRBD on some layer. DRBD is capable of handling sudden node outage fine, that is one of its main use cases. If that does not work for you, that is a bug in DRBD.

What this service should do is add additional information to the cluster: If a node goes down, the service should run and "drbdsetup down ..." all DRBD devices. That will cause DRBD to propertly terminate all TCP connections, so all other cluster nodes immediately "know" that the node is unavailable. If you do not run the service, eventually the system will just shutdown, leaving the DRBD tcp connections open. The other nodes will notice after a timeout and carry on, but until then, they will wait for the node. This is not a big deal, but just something that makes node down/failover a bit smoother.

Jul 31 '25 11:07 WanzenBug

Ok thanks for taking the time to explain.

I do not know why I end up with corrupted XFS filesystems. I just know it happens every now and then (e.g. When a node restarts but doesn't unmount during the grace period (Talos will wait 90s then just force restart) OR when I run a failed update and have to send a force shutdown command to the node BMC. The node then just restart without any warning). At least I believe these are the 2 cases where I saw this. It happened more than once too. If this is a DRBD bug then how can I debug this ?

Aug 05 '25 12:08 bernardgut

Ok thanks for taking the time to explain.

I do not know why I end up with corrupted XFS filesystems. I just know it happens every now and then (e.g. When a node restarts but doesn't unmount during the grace period (Talos will wait 90s then just force restart) OR when I run a failed update and have to send a force shutdown command to the node BMC. The node then just restart without any warning). At least I believe these are the 2 cases where I saw this. It happened more than once too. If this is a DRBD bug then how can I debug this ?

I have very similiar experience running Talos and Piraeus. I have not been able to identify why this happens, but for some reason the filesystem gets corrupted and I have to do xfs_repair -L which is a LAST RESORT command. I run that since the filesystem can't be mounted and repaired any other way. Haven't lost a file yet. But I don't write that much on these volumes.

Aug 26 '25 18:08 sebedh

I'm having this issue as well. I'm encrypting all partitions, and I'm not sure if it's the cause but I also noticed that the /var partition is failing to cryptsetup close with EBUSY:

{
  "component": "controller-runtime",
  "controller": "block.VolumeManagerController",
  "volume": "EPHEMERAL",
  "phase": "failed -> failed",
  "error": "error closing encrypted volume mapped to \"luks2-EPHEMERAL\": error closing luks2-EPHEMERAL: mapped device is still in use",
  "location": "/dev/nvme0n1p5",
  "mountLocation": "/dev/dm-1",
  "parentLocation": "/dev/nvme0n1"
}

@WanzenBug

For non-systemd services, we have no replacement. If you have an idea, we can implement it.

While a node is being shut down, the Node API object has a status.conditions like this one:

status:
  conditions:
    - type: Ready
      status: "False"
      reason: KubeletNotReady
      message: node is shutting down
      lastTransitionTime: "2025-10-22T16:25:57Z"
      lastHeartbeatTime: "2025-10-22T16:26:00Z"

There's more information about this in https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/

The satellite pod should have a preStop hook that checks the node for this condition, and if the node is shutting down it should detach all drbd devices.

As an example on how to do it with sh and kubectl:

if [ "$(kubectl get node $KUBE_NODE_NAME -o jsonpath='{.status.conditions[?(.type=="Ready")].message}')" \
   == "node is shutting down" ]; then
    # EXIT_CLEANLY
fi

I don't know what EXIT_CLEANLY should be, but I'd like to have the option to evacuate the linstor node when the kubernetes node shuts down. It'd then be the cluster operator's responibility to set the kubelet shutdownGracePeriodCriticalPods to be long enough for all the data to sync to another node.

Keep in mind that any container without preStop will be sent SIGTERM at the same time that the preStop hook runs. It's valid to trap SIGTERM instead of implementing a hook.

Kind regards

Oct 22 '25 17:10 agusdallalba

That seems like a suggestion that could work. I guess exit cleanly would generally be:

Wait for normal Pods to drain. Guess this is best handled by being a "critical pod".
Wait for other volumes to be detached. Not sure if we need some additional coordination there to keep the linstor-csi-node Pods running long enough to call umount.
After some timeout, if a volume is still mounted, call drbdadm secondary --force ....
After some additional timeout, run drbdadm disconnect .....

The node should then be ready for a clean shutdown.

Note that this would not evacuate the LINSTOR Satellite. This is better handled on the Operator side instead of in some shutdown hook. The next release, we will properly support that when using ClusterAPI. Alternatively, if you set the deletionPolicy: Evacuate on the Satellite, and then delete it after the Kubelet has initiated the shutdown, the Operator will ensure an orderly evacuation.

Oct 23 '25 06:10 WanzenBug

Wait for normal Pods to drain. Guess this is best handled by being a "critical pod".

I agree, and this is already handled: both the satellite and csi-node Pods are declared system-node-critical. The only thing missing is to document the need to set shutdownGracePeriod and shutdownGracePeriodCriticalPods in kubelet.yaml. By default Talos sets these to 30s and 10s, respectively.

Wait for other volumes to be detached. Not sure if we need some additional coordination there to keep the linstor-csi-node Pods running long enough to call umount.

It turns out the kubelet has special logic for this and won't start terminating critical pods before all the volumes from normal pods have been unmounted. So this too is already handled for us :)

After some timeout, if a volume is still mounted, call drbdadm secondary --force ....

After some additional timeout, run drbdadm disconnect .....

The node should then be ready for a clean shutdown.

I made these patches to my cluster and tried it out:

Add get Node permissions to the satellite ServiceAccount

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: linstor-satellite
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: linstor-satellite
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: linstor-satellite
subjects:
  - kind: ServiceAccount
    name: satellite
    namespace: piraeus-datastore

Set automountServiceAccountToken: true in the linstor-satellite daemonset template
Add a preStop hook to the daemonset:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: graceful-shutdown
spec:
  podTemplate:
    spec:
      automountServiceAccountToken: true
      containers:
        - name: linstor-satellite
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -xc
                  - |
                    # Redirect stdout to container logs
                    exec 1<>/proc/1/fd/1 2<>/proc/1/fd/2
                    MESSAGE="$(kubectl get node $LB_FORCE_NODE_NAME -o jsonpath='{.status.conditions[?(.type=="Ready")].message}')"
                    if [ "$MESSAGE" = "node is shutting down" ]; then
                      drbdadm secondary all --force
                      drbdadm down all
                      # losetup detach should have been done by the satellite 
                      # immediately after drbd attach to prevent leaking loop devices
                      losetup -D
                    fi

This did not work on my single-node test cluster because the kube-apiserver Pod was being terminated at the same time, before kubectl could run :(

But it did work on my production cluster! Now I no longer get errors in the console when restarting the node!

In the end I came up with this workaround in the postStop script that also works on single-node clusters:

                    # Redirect stdout to container logs
                    exec 1<>/proc/1/fd/1 2<>/proc/1/fd/2

                    # losetup detach everything so the kernel can release the
                    # loop devices as soon as they are not needed anymore
                    losetup -D

                    drbdadm down all

                    MESSAGE="$(kubectl get node $LB_FORCE_NODE_NAME -o jsonpath='{.status.conditions[?(.type=="Ready")].message}')"
                    if [ "$MESSAGE" = "node is shutting down" ]; then
                      drbdadm secondary all --force
                      drbdadm down all
                    fi

This runs drbdadm down all unconditionally.

It handles the happy cases without much disruption: replication to this node will stop briefly when if satellite is restarted but the new satellite reattaches the device soon after. And since I don't use the --force flag, pods with mounted volumes are unaffected.

I did not handle the unhappy case where the node is partitioned from the network, drbd has suspended IO, the apiserver is unreachable, and the node is shut down. I didn't have time to research how to know if a node is being shutdown when the apiserver is unreachable.

I hope this helps!

Oct 23 '25 20:10 agusdallalba