piraeus-operator icon indicating copy to clipboard operation
piraeus-operator copied to clipboard

Introducing `piraeus.io/evacuate` Taint for Node Draining Support

Open boedy opened this issue 2 years ago • 8 comments

I'm pleased to share my first contribution, focused on enhancing node draining in Linstor/Piraeus clusters. Feedback welcome on areas that could be further improved / optimised or things I might have overlooked.

Note: This PR is not yet complete. While the core functionality is in place, documentation should be added once the initial review is done and any suggested changes are incorporated.

Summary:

This pull request introduces a new annotation, linstor.linbit.com/prepare-for-removal, for Kubernetes nodes. When a node is annotated with this label, the LinstorSatelliteReconciler prepares the node for draining by auto placing additional replicas for all its resources.

The new annotation serves a crucial role in scenarios where a resource does not already have replicas. By creating additional replicas, it ensures that data remains available even when the original node is offline and allows the workload to be migrated in cases where allowRemoteVolumeAccess: false is defined. This feature is also particularly useful as Piraeus currently lacks (please correct me if i'm wrong) an auto-healing functionality to maintain the replica count automatically in case a node is lost.

The annotation provides a simple and declarative way to prepare nodes for maintenance or decommissioning while enhancing data availability.

Required Components:

For this feature to fully function and allow Kubernetes to reschedule the workload to a new node, the Persistent Volume (PV) needs to be updated. This is handled by the linstor-affinity-controller, which must be installed on the cluster. The controller updates the PV to enable the Pod to move to a new node, complementing the node draining and undo-draining functionalities introduced by this pull request.

Changes:

  1. New Annotation:

    • linstor.linbit.com/prepare-for-removal: When added to a node, it triggers the prepareNodeForDraining method. When removed it will trigger the undoNodeDrainingPreparation method to reverse the changes.
  2. prepareNodeForDraining Method:

    • Fetches the list of resources on the annotated node.
    • Autoplaces an additional replica for each resource.
    • Updates node properties to mark it for removal and disable new resource placements.
  3. undoNodeDrainingPreparation Method:

    • Counterpart to prepareNodeForDraining.
    • Deletes either the newest replica or the one on the annotated node, depending on which resource is currently in use.
    • Removes the node properties set during the preparation for draining.

Additional Information:

  • The current implementation necessitates an extra API call and logic to ascertain the location of the newly-created replica. This is because the lc.Resources.Autoplace method does not return this specific metadata. As per the Swagger documentation for the Linstor API, the obj_refs field could potentially contain the required information. To streamline this process, the Golinstor library would need to be updated to accommodate these changes.

boedy avatar Sep 20 '23 07:09 boedy

Having thought some more about it: Would it make sense to use a custom taint instead of a node label? So piraeus.io/node-removal: NoSchedule? That way we would also automatically prevent kubernetes from trying to create new resources on the node.

Other things I noticed: I think the Aux/PrepareForRemoval property can be removed. With each reconcile attempt, we should check that a replacement resource is ready, by checking for a matching Aux/CopiedOverFromNode resource. We would then also need to wait for the resource to actually fully sync up before reporting success.

Then finally, we need to remove the Aux/CopiedOverFromNode once the original node is gone, so we don't accidentally remove the resource at a later point because of it.

WanzenBug avatar Oct 11 '23 11:10 WanzenBug

Hey @WanzenBug,

Thanks for your feedback. I'm on board with the idea of using a taint over an annotation. It does seem to fit the context better, especially considering the eventual node deletion.

Reflecting on it, I think you're spot on about relying exclusively on the Aux/CopiedOverFromNode property. I had initial reservations about potential race conditions, but given the synchronous nature of the reconciliation process, those concerns seem unfounded.

Lastly, taking your suggestions into consideration, I've mapped out a path for the cleanup once everything's in sync and the original node's been decommissioned. I'll try to make some time in the upcoming days and get these changes integrated into the PR.

boedy avatar Oct 13 '23 13:10 boedy

Hey @WanzenBug,

I've just pushed and update with my latest changes, taking into account the feedback you've provided. I must admit, there were a few tricky edge cases that really made me scratch my head, particularly around handling nodes that housed resources from another node which had previously been tainted. After some iterations, I'm confident in the solution I've landed on.

To provide a high-level overview of the process in such scenarios:

  1. Consider a three node cluster with nodes A, B & C
  2. Node A is tainted, starting the evacuation process of node A
  3. All resources on Node A are individually copied to either Node B or Node C
  4. Once all resources from Node A are successfully synchronised to the other nodes, Node A is marked with a 'success' state, indicating it's ready to be drained / deleted
  5. At any point after Node A, Node B is also tainted
  6. All resources from Node B are copied to Node C. This includes any resources that were originally on Node A and got replicated to Node B in the earlier step.
  7. As soon as a resource (originally from Node A) is fully synchronised to Node C, the resource on Node B is automatically deleted. Additionally, the reference, Aux/EvacuatedFromNode, is updated on Node C to indicate the original source of the resource (Node A in this case).

One thing to note: the current approach does result in a higher number of API calls during each reconciliation cycle, which is something you might want to revisit and optimise down the line.

Let me know what you think.

boedy avatar Oct 28 '23 17:10 boedy

I just pushed some bugfixes which I ran into and overlooked at first glance.

One think that isn't yet handled gracefully is the actual deletion of the node / satellite. After I delete the k8s node. The operator spams the following error on every reconciling loop:

ERROR Reconciler error {"controller": "linstorsatellite", "controllerGroup": "piraeus.io", "controllerKind": "LinstorSatellite", "LinstorSatellite": {"name":"h-fsn-ded3"}, "namespace": "", "name": "h-fsn-ded3", "reconcileID": "bb2b3d68-4b98-4a24-a715-0b46b1b3fa68", "error": "Message: 'Node: h-fsn-ded3, Resource: pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93 preparing for deletion.'; Details: 'Node: h-fsn-ded3, Resource: pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93 UUID is: 6a1d14e7-658d-4a89-9d3d-b4810f0d40cb' next error: Message: 'No connection to satellite 'h-fsn-ded3'' next error: Message: 'Preparing deletion of resource on 'h-fsn-ded2'' next error: Message: 'Preparing deletion of resource on 'h-fsn-ded5'' next error: Message: 'Deletion of resource 'pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93' on node 'h-fsn-ded3' failed due to an unknown exception.'; Details: 'Node: h-fsn-ded3, Resource: pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93'; Reports: '[653FE239-00000-000018]'"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /Users/boedy/projects/go/me/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /Users/boedy/projects/go/me/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /Users/boedy/projects/go/me/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 2023-10-30T18:19:25+01:00 INFO Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler {"controller": "linstorsatellite", "controllerGroup": "piraeus.io", "controllerKind": "LinstorSatellite", "LinstorSatellite": {"name":"h-fsn-ded3"}, "namespace": "", "name": "h-fsn-ded3", "reconcileID": "201d80fc-f00b-4a8d-bf52-6e77d07f4c20"}

When draining the node the satellite pod also gets evicted. As the node is tainted it won't get re-spawned, in turn not allowing the operator to gracefully remove the satellite. How should this be resolved?

Lastly as AutoplaceTarget is always reset in the undo method. Evacuating multiple nodes isn't a pleasant experience.

boedy avatar Oct 31 '23 08:10 boedy

I think the satellite needs to tolerate it's own taint.

WanzenBug avatar Oct 31 '23 09:10 WanzenBug

Ok, sorry for the long delay. I played around with this PR a bit more, but I wasn't really happy with the general workflow.

There is some improvement coming on the LINSTOR front, namely LINSTOR will no longer remove the resource on the evacuating node while it is still in use. What that means is that we can simply use Evacuate() instead of having to move volumes around manually. At least once the new feature is released.

I do plan to revisit this PR when that feature is available, and also work on making the behaviour configurable. Some users had trouble with the automatic evacuation when removing a node from k8s, so we should take a look at these processes in general. I envision two new settings

  • nodeDrainPolicy, triggered when unschedulable is set on the node.
    • Ignore: just ignore it, satellite stays online
    • NoSchedule: do not place new resources, but keep existing
    • Evacuate: trigger node evacuation (what this PR tries to do)
  • nodeDeletePolicy, triggered when the matching k8s node gets deleted
    • Orphan: just delete the satellite resource, but leave the registered node in LINSTOR alone
    • Evict: try to evacuate/evict the node, placing the resources on other nodes.

WanzenBug avatar Feb 13 '24 11:02 WanzenBug

Hey @WanzenBug, Thanks for the update. I think the two settings make sense.

There is some improvement coming on the LINSTOR front, namely LINSTOR will no longer remove the resource on the evacuating node while it is still in use. What that means is that we can simply use Evacuate() instead of having to move volumes around manually. At least once the new feature is released.

Regarding above: How would volumes be handled which only have 1 replica? Would they still be migrated?

boedy avatar Apr 01 '24 20:04 boedy

Regarding above: How would volumes be handled which only have 1 replica? Would they still be migrated?

Yes. When the evacuation is triggered, such a resource would have 2 replicas:

  • on the node to evacuate
  • on the replacement node As long as the resource is Primary on the node to evacuate, it will not be deleted.

In the meantime the affinity controller should have updated the PV, k8s should have evicted the Pod, so the resource can become secondary, at which point it is removed by LINSTOR and you are again left with just one replica.

WanzenBug avatar Apr 02 '24 06:04 WanzenBug