HA for trident controller pod
Describe the issue
Our major concern is during the TKC upgrade all node re-create in rolling manner all the pod will evicted and re-created it. The node rolling restart will not wait all pod need to be up before re-create the next node. Trident controller pod does not have pod distribution budget. If pod waiting for long time to create due to network bandwidth/wrong reg credentials. All dameonset pod will broken state which will impact all application pods hosted in our environment.
Describe the solution you'd like
Requesting multiple replicas for trident controller and requesting pod distribution budget for trident controller.
All dameonset pod will broken state which will impact all application pods hosted in our environment.
What do you mean by that ? Did you notice that during upgrade when trident-node PODs are struggling to start, all other PODs with PVCs are crashing ?
Hi @kmuruganc are you talking about pod disruption budget ?
Hi @ptrkmkslv ,
Question: What do you mean by that ? Did you notice that during upgrade when trident-node PODs are struggling to start, all other PODs with PVCs are crashing ?
Answer: The VMware Tkgs will re-create nodes in rolling manner during upgrade process. During the TKGs node re-creation will not wait for all pods were up before re-creating the next node. The trident controller images are more than 1GB to download. It may struck during download the image due to network bandwidth. If the trident controller pods were waiting for long time to create, all the trident-node-Linux daemon set pods were struck in 2nd stage 1/2.This will impact all pods were mounted with PVC.
Hence we are checking if trident controller to handle multiple replica with pod distribution budget will avoid this issue.
Hi @alloydsa,
Question: are you talking about pod disruption budget ?
Answer:
If trident controller has pod disruption budget will not allow evicted the pods from existing nodes until remaining pods in read status.
Thanks Kantharaj Murugan
@kmuruganc
If the trident controller pods were waiting for long time to create, all the trident-node-Linux daemon set pods were struck in 2nd stage 1/2.This will impact all pods were mounted with PVC.
Can you explain what do you mean by that (in detail) ? What does it mean "This will impact all pods were mounted with PVC" ?
Does it mean that all running PODs lost access to PVCs due to somehow missing 'trident-controller' ?
It appears that the Trident Controller is not fault-tolerant by default if I read this Enhancement request correctly. I will take a look at the controller implementation and report back my findings and any prescriptions my search may yield.
The Trident controller is fault-tolerant by virtue of its running as a Deployment, so K8s will restart it after a node drain or any other disruption. The Deployment also uses a liveness probe to ensure it remains responsive.
For efficiency and to reduce the load on the K8s API server, the controller caches its CRD data internally, making running multiple replicas difficult. And using a PodDisruptionBudget on a single-replica Deployment isn't helpful as that would prevent normal maintenance operations. We could consider setting the system-cluster-critical PriorityClass on the Deployment, much as we already use system-node-critical on the Daemonset. We also plan to add minimum memory/CPU resource values. Both of these should help minimize controller downtime.
For completeness, we could theoretically run multiple controller replicas in an active-passive arrangement with leader election. This would reduce downtime by eliminating the need to pull container images to another node, but it would not be instantaneous due to the need to warm the CRD cache from K8s. This is a lot more work and not currently planned, though still simpler than building cache coherency in a multiple replica active-active arrangement.
Hi @clintonk Given the importance of the trident controller (needed for volume mounting and therefore for pod creation), I think it is very important that the trident controller runs in a HA manner. The way I see this, there are two things that must be done:
- Set the priorityClassName. this would make sure that the trident controller pod is one of the first pods to get started. I have created an issue: https://github.com/NetApp/trident/issues/1014
- Run the controller as a multi-pod deployment. This means that there would be minimal, if not no, downtime. I have created an issue for this, a well: https://github.com/NetApp/trident/issues/1015
Configuration possibility for the CPU and Memory Requests and Limits is also very important - I can see that the memory usage of the controller pod is high in our big clusters, which means that once the limit of 1Gb is reached, the pod would get killed, which would have severely negative consequences on the cluster.