trident
trident copied to clipboard
Support zero-downtime upgrading for the Trident controller plugin
Describe the solution you'd like We would like the trident operator to upgrade the Trident controller plugin without downtime.
Similar to https://github.com/NetApp/trident/issues/740, the trident operator deletes the deployment for the Trident controller plugin once when updating the trident version. It causes all the Trident functionality to be unavailable until the new controller pod becomes ready.
Furthermore, the deployment for the trident controller plugin has only one replica, and its strategy is Recreate. So even after the trident operator would not delete the deployment, when the old pod failed to be deleted, the deployment controller does not create a new controller pod, causing all the Trident's functionality not to work.
Because the situation that we cannot delete pods (stuck in Terminating state) is a common problem in Kubernetes, we would like to have multiple replicas of the Trident controller plugin with leader election.
Describe alternatives you've considered none
Additional context This situation can be reproduced with the following steps.
- Deploy the trident operator v22.01.1 with the TridentOrchestrator object.
- Wait until all trident pods become ready.
- Set a dummy finalizer to the Trident controller pod.
- e.g.
kubectl patch -n trident -p '{"metadata":{"finalizers": ["example.com/dummy"]}}' "$(kubectl get pods -n trident -l app=controller.csi.trident.netapp.io -o name | head -1)" - This step simulates the controller plugin pod cannot be deleted.
- e.g.
- Update the trident operator and the TridentOrchestrator object to v22.04.0.
- There will be no healthy controller pod, which means all the Trident functionality does not work.
$ kubectl get pods -n trident -l app=controller.csi.trident.netapp.io
NAME READY STATUS RESTARTS AGE
trident-csi-ccc5cdd56-hkppj 0/6 Terminating 0 6m5s
@gnarl In this Issue, there is a difference between Deployment and DaemonSet, but the impact on the our user is the same as https://github.com/NetApp/trident/issues/740#issuecomment-1210188897 case.
In especially, in the case of the Trident controller plugin, replicas is 1, so it is SPOF. If Trident controller plugin is terminating status, all Pods(app) are pending when upgrade Kubernetes or nodes.