trident icon indicating copy to clipboard operation
trident copied to clipboard

Support zero-downtime upgrading for the Trident controller plugin

Open tksm opened this issue 3 years ago • 1 comments
trafficstars

Describe the solution you'd like We would like the trident operator to upgrade the Trident controller plugin without downtime.

Similar to https://github.com/NetApp/trident/issues/740, the trident operator deletes the deployment for the Trident controller plugin once when updating the trident version. It causes all the Trident functionality to be unavailable until the new controller pod becomes ready.

Furthermore, the deployment for the trident controller plugin has only one replica, and its strategy is Recreate. So even after the trident operator would not delete the deployment, when the old pod failed to be deleted, the deployment controller does not create a new controller pod, causing all the Trident's functionality not to work.

Because the situation that we cannot delete pods (stuck in Terminating state) is a common problem in Kubernetes, we would like to have multiple replicas of the Trident controller plugin with leader election.

Describe alternatives you've considered none

Additional context This situation can be reproduced with the following steps.

  1. Deploy the trident operator v22.01.1 with the TridentOrchestrator object.
  2. Wait until all trident pods become ready.
  3. Set a dummy finalizer to the Trident controller pod.
    • e.g. kubectl patch -n trident -p '{"metadata":{"finalizers": ["example.com/dummy"]}}' "$(kubectl get pods -n trident -l app=controller.csi.trident.netapp.io -o name | head -1)"
    • This step simulates the controller plugin pod cannot be deleted.
  4. Update the trident operator and the TridentOrchestrator object to v22.04.0.
  5. There will be no healthy controller pod, which means all the Trident functionality does not work.
$ kubectl get pods -n trident -l app=controller.csi.trident.netapp.io
NAME                          READY   STATUS        RESTARTS   AGE
trident-csi-ccc5cdd56-hkppj   0/6     Terminating   0          6m5s

tksm avatar Jul 12 '22 06:07 tksm

@gnarl In this Issue, there is a difference between Deployment and DaemonSet, but the impact on the our user is the same as https://github.com/NetApp/trident/issues/740#issuecomment-1210188897 case.

In especially, in the case of the Trident controller plugin, replicas is 1, so it is SPOF. If Trident controller plugin is terminating status, all Pods(app) are pending when upgrade Kubernetes or nodes.

ysakashita avatar Aug 16 '22 06:08 ysakashita