traefik-helm-chart Should rolling controller update be the default strategy?

The Helm chart sets the update strategy for the Traefik controller Deployment to RollingUpdate. I am wondering if this is a "sane" default when nothing is known about the cluster where this is installed?

Isn't it the case that when doing a rolling update, multiple controllers with different configs could be present at the same time (e.g. one old and one new). If they are behind a simple load balancing service, a request from a client could be directed first to the controller with the new and then to the one with the old settings again, causing weird intermittent effects at the client, which might be worse than the short downtime effected by Recreate strategy.

It seems to me that Recreate would be safer in this regard, and if an cluster operator wants it explicitly, they could opt in to the RollingUpdate strategy.

Or at least allow both options to be set as template values and give a short explanation of the implications as a comment?

Jun 04 '21 12:06 jhyot

IMHO, RollingUpdate should fit all use cases. Do you have a specific scenario in mind?

Sure, applying an update would result in sequential restarts and a window of time during which different copies of Traefik would run. This is the default behavior for a Kubernetes Deployment, while DaemonSets would also apply upgrades restarting containers one after the other.

Assuming intermittent effects are indeed related to Traefik updates -- and that sessions stickyness/cookies/... is otherwise configured whenever necessary. Assuming that we're talking about deployments with 2+ replicas -- as the default values would set rollingParams maxUnavailable to 1 / 100%, which results in old container being terminated when the new one is created, thus no real case of two pods simultaneously assuming the same role when only 1 replica is involved.

If updating Traefik, it is fair to assume the new containers should behave in a similar way to the previous copy. I'm not convinced as a cluster admin, I should care about who's serving requests rolling out such an update. If all goes well, update would be applied relatively quickly. If not: the first container would not be created, or its liveness/readiness would not allow for traffic getting in, I can troubleshoot this while my other previous copies are still healthy. Of course, generally speaking, hosting applications in Kubernetes: there could be exceptions, depending on the specifics of a given update. Then again, Kubernetes Ingress Controllers implement a relatively basic and generic function. I would not worry about breaking changes, unless the Traefik changelog says otherwise.

Whereas re-applying the same Chart version editing your deployment arguments, maybe mounting a ConfigMap: I guess that argument could be valid? Though I would still expect the defaults to proceed with upgrades in a way that would minimize impact for clients. With a Recreate strategy, all upgrades would imply shutting down everything, then pulling the new image and restarting a first container, it goes beyond shutting down potentially active client connections: there would be a short period during which clients would be unable to reconnect their webservice. And this is assuming the new containers do start up. If, for any reason, my previous Chart values somehow break with the new image, my container would be stuck in CrashLoopBackOff, leaving me with no running copies left.

Recreate doesn't sound like a sane default, at least for most use cases. So far, the only case for Recreate I could have made was regarding the usage of RWO persistent volumes, as reported in https://github.com/traefik/traefik-helm-chart/issues/355 I did submit a PR: https://github.com/traefik/traefik-helm-chart/pull/356 And retrospectively, I was wrong. The default rollingParams covers that case pretty well.

And while thie is not necessarily a good argument, it could be worth mentioning that other Kubernetes Ingress Controllers would also use the RollingUpdate strategy.

nginx ( https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v0.47.0/deploy/static/provider/baremetal/deploy.yaml ) does not mention any strategy in its Deployment, resulting in RollingUpdate being applied
nginxinc copy as well: https://github.com/nginxinc/kubernetes-ingress/blob/master/deployments/deployment/nginx-ingress.yaml
same goes for HAProxy: https://github.com/haproxytech/kubernetes-ingress/blob/master/deploy/haproxy-ingress.yaml#L137
while Ambassador allows to set a different strategy: https://github.com/datawire/ambassador-chart/blob/master/templates/deployment.yaml#L39 though the default remains to use RollingUpdate https://github.com/datawire/ambassador-chart/blob/master/values.yaml#L245

When in doubt, as with any other Kubernetes hosted application, I should test my assumptions on a non-production environment, before going live. Though for the little experience I have with Traefik: applying updates is not something I usually worry about, with the exception of v1/v2, beyond the scope of this Chart.

If you did experience with some specific issue, let us know about it. In theory, setting a different strategy could be doable, though I'm not convinced it would ever be preferable.

Jun 13 '21 08:06 faust64

Thanks for the detailed answer. That was exactly the reason why I opened the ticket, to get the team's thought process behind the update strategy. I don't have any specific issue that triggered this question, it was just something I noticed in the Chart and wanted to ask about. I agree with your elaborations.

The situation I had in mind was the scenario with re-applying the same Chart version after editing the deployment arguments, together with having more than 1 replica. E.g. changing frontend or backend TLS settings, where simultaneously deployed old and new config could lead to weird errors regarding TLS (and similar stuff, the scenario is not tied to TLS).

But I agree that setting the strategy to 'Recreate' leaves a risk of the new version not coming up at all, thus triggering a service interruption, which is probably worse except in some quite specific situations.

So for me this is settled.

Jun 14 '21 07:06 jhyot