system-upgrade-controller fix: pod replacement policy should be failed

This is causing jobs to fail again and again if the pod is terminating (node is draining, etc). This ensures that the job is only recreated if the job/pod actually failed.

https://kubernetes.io/blog/2023/08/21/kubernetes-1-28-jobapi-update/

Aug 27 '24 19:08 buroa

Hey @brandond, I currently maintain a fork of this change that a lot of the community uses. The current one does not work for people running K8s > 1.28. Let me know if you need anything else here.

Aug 27 '24 19:08 buroa

This is the first I've heard of a problem with this, and we deploy SUC for upgrades as part of Rancher. We also test it regularly for automated upgrades of standalone RKE2 and K3s clusters.

Can you point me at any issues reporting this problem and describing how to reproduce it?

Aug 27 '24 21:08 brandond

Can you point me at any issues reporting this problem and describing how to reproduce it?

I can have the community weigh in here as I'll put this thread inside our Discord. We are using Talos and wanted to have rolling upgrades much like k3s.

Here is our job spec; talosctl takes care of the drain.

Essentially what talosctl does here is drains the workloads and then applies the new kernel and reboots. Once this takes place, the job will see the pod status as terminating and recreate the pod. So per K8s 1.28, this will cause the job to be recreated again.

So the job will be recreated, which will cause a reboot again and again. It never gets a success and system-upgrade-controller can't continue on with another node.

You can recreate this by deploying a Talos cluster and using the system-upgrade-controller to do a upgrade.

@onedr0p @bjw-s @jfroy @JJGadgets

Aug 28 '24 10:08 buroa

Once this takes place, the job will see the pod status as terminating and recreate the pod. So per K8s 1.28, this will cause the job to be recreated again.

This sounds like desired behavior. The job image is supposed to be idempotent, and exit successfully if the upgrade is either completed successfully, or has already been applied. There is no guarantee that the job will only be run once; it may be re-run several times before exiting successfully. You'll see this with the OS upgrade examples, where the nodes are rebooted if necessary. This will cause the pod to terminate early, and run again following the reboot - however on a second pass they will no longer be rebooted, and exit cleanly. https://github.com/rancher/system-upgrade-controller/blob/master/examples/openSUSE/microos.yaml

Why is your job image exiting with a non-zero exit code following a successful upgrade, or if the upgrade is no longer necessary?

Aug 28 '24 17:08 brandond

Once this takes place, the job will see the pod status as terminating and recreate the pod. So per K8s 1.28, this will cause the job to be recreated again.

This sounds like desired behavior. The job image is supposed to be idempotent, and exit successfully if the upgrade is either completed successfully, or has already been applied. There is no guarantee that the job will only be run once; it may be re-run several times before exiting successfully. You'll see this with the OS upgrade examples, where the nodes are rebooted if necessary. This will cause the pod to terminate early, and run again following the reboot - however on a second pass they will no longer be rebooted, and exit cleanly. master/examples/openSUSE/microos.yaml

Why is your job image exiting with a non-zero exit code following a successful upgrade, or if the upgrade is no longer necessary?

I'm not sure if you are following. The command does exit 0, but talosctl is draining the node so the pod gets evicted during the drain. Since the status was terminating, the job spec will recreate it.

This isn't a hard change. If you are opposed to it, I will maintain my fork.

Aug 28 '24 18:08 buroa

I'm not sure if you are following. The command does exit 0.

That's not what you said, you indicated that the job was failing: This is causing jobs to fail again and again. What you're describing is not failure, its just that the job pod is getting recreated - which you seem to be rounding up to a failure state.

talosctl is draining the node so the pod gets evicted during the drain. Since the status was terminating, the job spec will recreate it. So the job will be recreated, which will cause a reboot again and again.

As I said, this is a problem with your job image. It should NOT drain the node if it does not need to upgrade the kernel and reboot.

This isn't a hard change. If you are opposed to it, I will maintain my fork.

I don't understand why you would want to maintain a fork instead of just making your job pod observe the correct idempotent behavior.

Aug 28 '24 18:08 brandond

It seems like you don't want this change and I'm certainly not going to "sell" you the idea when you are clearly against it in the first place.

Aug 28 '24 18:08 buroa

I'm not opposed to it per se, I'm just trying to understand if there's something this fixes other than improving handling of jobs that are not idempotent. This is a basic requirement that is literally called out in the readme:

https://github.com/rancher/system-upgrade-controller/blob/master/README.md#considerations

Additionally, one should take care when defining upgrades by ensuring that such are idempotent--there be dragons.

Is the whole point of your fork just to no longer require idempotent jobs? What else are you modifying to accommodate that goal?

Aug 28 '24 18:08 brandond

I'm a little less technical on this subject, but going to leave my comment in support.

A couple (about 62 of us according to kubesearch) of us are relying on buroa's fork as the rancher upstream doesn't work for Talos-based environments. (I would very much like to rely on upstream instead of a fork) Essentially with the talosctl image, the plan would be running while talos is draining the node and upgrading then preforming the upgrade. If we used rancher upstream, the upgrade would continuously be created until it reaches the backoffLimit. This just changes the spec to wait until the pod has actually failed and is not terminating because it has a deletionTimestamp.

His fork adds a lot of automation for the updating and deployment of the fork, other than that the changes he proposes here are the only changes that are needed to bring support to us.

Maybe the right value here to default to is TerminatingOrFailed (default in kubernetes) instead of just Failed (what talos needs) according to the Job spec? https://github.com/kubernetes/website/blob/dbfbc78c7eb4970fc003a932fcc79eb4a9f7155e/content/en/docs/reference/kubernetes-api/workload-resources/job-v1.md?plain=1#L246 as that was the default value.

Aug 28 '24 22:08 kashalls

Can someone point me at the fork in question? I am curious if the talos upgrade job would work as-is without forking if the talosctl upgrade image followed best practices around idempotency instead of unconditionally draining and rebooting the node.

Aug 28 '24 23:08 brandond

Can someone point me at the fork in question? [...]

https://github.com/buroa/system-upgrade-controller/

Aug 28 '24 23:08 kashalls

talosctl upgrade itself sends an RPC to Talos's machined which then unconditionally runs an upgrade sequence. So Talos does violate the assumptions or requirements SUC places on upgrade plans. It may be possible to fix this in Talos and we can take it up with them. It may also be reasonable to consider bending SUC with something like this PR to accommodate systems that don't track previously-applied upgrade requests/sequences/commands.

Aug 29 '24 05:08 jfroy

@buroa I would be open to this if the default was the current Kubernetes default of TerminatingOrFailed, as proposed by @kashalls

Other than dependency bumps and example jobs, I'm not seeing anything else that you've changed in your fork. Is that accurate?

Aug 29 '24 15:08 brandond

@buroa I would be open to this if the default was the current Kubernetes default of TerminatingOrFailed, as proposed by @kashalls

Other than dependency bumps and example jobs, I'm not seeing anything else that you've changed in your fork. Is that accurate?

I can change that. And that is correct, these are the only real changes required for this to work.

Aug 29 '24 15:08 buroa

@brandond Ready for you

Aug 29 '24 16:08 buroa