Avoid upgrade being killed by failed liveness probes
Pull Request
Description of the change
This runs the upgrade in an init container, before the main container starts. It sets the command arguments to "true", so the container exits immediately after the upgrade, and sets the variable NEXTCLOUD_UPDATE to 1, without which the upgrade step is skipped because there are command arguments (see entrypoint).
Benefits
Avoid the container being stopped during the upgrade because of failed liveness probes, since init containers don't get probed.
Possible drawbacks
Runs an additional container, so it's a bit slower I guess.
Applicable issues
- fixes #333
Additional information
This also removes the nextcloud.update value. I don't see a way people could possibly have used it though, since it only does something when you pass different arguments to the image, and there is no value in this chart that will allow you to do that.
Checklist
- [x] DCO has been signed off on the commit.
- [x] Chart version bumped in
Chart.yamlaccording to semver. - ~~(optional) Variables are documented in the README.md~~
Thanks for submitting a PR! :)
Will this always run an upgrade anytime the pod restarts or additional are spun up? Is there a way to disable upgrades until a user is ready?
It will update the volume to the version of the image, so it's usually triggered by the user doing helm upgrade (or kubectl set image I guess). If a Pod is restarted without changing version, nothing will happen.
This doesn't solve the situation where you need to run the web installer. Probes should probably be changed not to fail if the web installer needs to be run (otherwise the pod never becomes ready because not installed, and you can't reach the web installer to install)
This doesn't solve the situation where you need to run the web installer. Probes should probably be changed not to fail if the web installer needs to be run (otherwise the pod never becomes ready because not installed, and you can't reach the web installer to install)
Should you even be using the web based upgrader at all when using container images? The image has the updated files. If you use the web-based upgrader but don't update your container image version, you'll break stuff for sure.
You can disable the web-based updater to prevent people from borking their installation:
'upgrade.disable-web' => true,
https://docs.nextcloud.com/server/stable/admin_manual/configuration_server/config_sample_php_parameters.html
Search for the upgrade.disable-web option.
Yeah I also think it should auto-install but that is not the default behavior of running helm install
Hey, what would be needed to get this merged? Also, is it possible to backport this change to the last 3 versions, so 25, 26 and 27?
@4censord I haven't looked at this PR closely, but the conflicts need to be solved and the changes need to be reviewed. With this chart there is no backporting, you just overrride nextcloud.tag with your desired version.
I'll take a look at resolving the conflicts.
With this chart there is no backporting, you just overrride nextcloud.tag with your desired version.
Ok
The conflicts are solvable by simply rebasing onto the main branch. @remram44 can you rebase, please? Or if you'd be fine with that, i could open a new PR with your changes, and we close this one.
If I rebase, will somebody review? I don't like to put in work for it to be ignored
@jessebot can you please take a look at this? this is a longstanding issue with this chart and currently the only way to prevent upgrade failing is to disable all probes which is not really a solution.
I'm using fluxcd gitops tool to manage my cluster, and when the helm upgrade fails because it's taking longer than the probes, fluxcd will rollback the pod to the last working state. But nextcloud doesn't want to start the last working state too because of this error:
Can't start Nextcloud because the version of the data (27.1.3.2) is higher than the docker image version (27.1.2.1) and downgrading is not supported.
Are you sure you have pulled the newest image version?
And I will be left with the pod crashlooping forever until I manually fix it using occ commands.
And I will be left with the pod crashlooping forever until I manually fix it using
occcommands.
How exactly are you fixing this? I am left on 25.0.1 currently, and have yet to actually complete an upgrade with the helm deployment.
@4censord I use helm rollback nextcloud <revision> to get it back to the helm revision where the upgrade failed and get rid of the nextcloud error. Then exec into the pod and do php occ upgrade then php occ maintenance:mode --off
@budimanjojo Then i seem to have a different issue, that does not work for me. I have to roll back a backup after attempting an upgrade.
@provokateurin Would you mind taking a look at this once it's convenient? IMO its ready.
Sorry for the delay. Slowly making my rounds 🙏
@remram44 I did a quick look over and it seems ok, but don't we still want to be able to set upgrade in the values.yaml?
Also, there still needs to be a rebase, as there's conflicts.
Also, tagged @provokateurin to get an additional review to ask about keeping the upgrade parameter in the values.yaml.
I'm not 100% sure if I read https://github.com/nextcloud/docker/blob/master/README.md correctly, but it sounds like the "normal" container will still try to do the upgrade as the default command is used there. Then we have a race condition between the two containers and depending on which one gets to work first the upgrade is killed by the probes or not.
IIRK the upgrade env var is used for triggering the Nextcloud images upgrade mechanism. But because this PR completely supersedes the internal upgrade on startup stuff by running the upgrades explicitly before, the upgrade env var does not do anything any more.
Also, there still needs to be a rebase, as there's conflicts.
Yeah, back in 2023 when we said its ready there weren't.
Then we have a race condition between the two containers and depending on which one gets to work first the upgrade is killed by the probes or not.
No, the init container runs first, and the normal container only gets started once the upgrade is completes. If it had the upgrade var set, it just would not do anything because its already on the lasted version
You're right, I overlooked that it is an initContainer and not another sidecar. Then this makes sense to me, I'll have to give it some testing though to confirm it works as intended.
If you could do a rebase onto the latest main state then I'll give it a test.
We will have to wait for @remram44, it's their PR. I just have an interest in getting this merged because i run into this problem every Nextcloud upgrade.
Ah yeah I didn't look who submitted the PR and thought it was you :see_no_evil:
I rebased.
I am not sure how this interacts with #466/#525, I moved running the hook to the initContainer only. @wrenix should probably weigh in.
In my opinion with the hooks. It is a better solution to move it to the initContainer (that should work).
I'd say we should have multiple init-containers run in order. That would make them both easier, and would help with debugging. I'd have them in this order:
- wait for DB to be up
- Upgrade Nextcloud version
- install and update apps
- disable maintenance mode, possible more checks.
While that adds complexity in more containers, and will increase startup time per pod, IMHO it's then easier that writing complex scripts for doing everything within one container.
I'm not entirely sure what the advantage is. Multiple init containers that run the exact same image... that also means they will have to release/acquire a lock multiple time.
I don't see what you mean be "decrease startup time per pod". At least as much work has to happen, so that doesn't seem true?
decrease startup time per pod
This was meant to say "increase", fixed now.
Every container that starts takes a few seconds until it's ready to run its scripts. So if you run more containers sequentially, you have that delay multiple times until the container can check that it does not need to do anything.
Personally, I just would have done multiple containers, because that feels easier to do, rather than having to chain things in the same container. But its not a problem for me, and i wont argue against doing it another way.