helm icon indicating copy to clipboard operation
helm copied to clipboard

Avoid upgrade being killed by failed liveness probes

Open remram44 opened this issue 2 years ago • 45 comments

Pull Request

Description of the change

This runs the upgrade in an init container, before the main container starts. It sets the command arguments to "true", so the container exits immediately after the upgrade, and sets the variable NEXTCLOUD_UPDATE to 1, without which the upgrade step is skipped because there are command arguments (see entrypoint).

Benefits

Avoid the container being stopped during the upgrade because of failed liveness probes, since init containers don't get probed.

Possible drawbacks

Runs an additional container, so it's a bit slower I guess.

Applicable issues

  • fixes #333

Additional information

This also removes the nextcloud.update value. I don't see a way people could possibly have used it though, since it only does something when you pass different arguments to the image, and there is no value in this chart that will allow you to do that.

Checklist

  • [x] DCO has been signed off on the commit.
  • [x] Chart version bumped in Chart.yaml according to semver.
  • ~~(optional) Variables are documented in the README.md~~

remram44 avatar Jan 27 '23 17:01 remram44

Thanks for submitting a PR! :)

Will this always run an upgrade anytime the pod restarts or additional are spun up? Is there a way to disable upgrades until a user is ready?

jessebot avatar Jan 27 '23 17:01 jessebot

It will update the volume to the version of the image, so it's usually triggered by the user doing helm upgrade (or kubectl set image I guess). If a Pod is restarted without changing version, nothing will happen.

remram44 avatar Jan 27 '23 17:01 remram44

This doesn't solve the situation where you need to run the web installer. Probes should probably be changed not to fail if the web installer needs to be run (otherwise the pod never becomes ready because not installed, and you can't reach the web installer to install)

remram44 avatar Feb 09 '23 18:02 remram44

This doesn't solve the situation where you need to run the web installer. Probes should probably be changed not to fail if the web installer needs to be run (otherwise the pod never becomes ready because not installed, and you can't reach the web installer to install)

Should you even be using the web based upgrader at all when using container images? The image has the updated files. If you use the web-based upgrader but don't update your container image version, you'll break stuff for sure.

Jeroen0494 avatar Mar 04 '23 11:03 Jeroen0494

You can disable the web-based updater to prevent people from borking their installation:

'upgrade.disable-web' => true,

https://docs.nextcloud.com/server/stable/admin_manual/configuration_server/config_sample_php_parameters.html Search for the upgrade.disable-web option.

Jeroen0494 avatar Mar 04 '23 11:03 Jeroen0494

Yeah I also think it should auto-install but that is not the default behavior of running helm install

remram44 avatar Mar 04 '23 16:03 remram44

Hey, what would be needed to get this merged? Also, is it possible to backport this change to the last 3 versions, so 25, 26 and 27?

4censord avatar Oct 26 '23 13:10 4censord

@4censord I haven't looked at this PR closely, but the conflicts need to be solved and the changes need to be reviewed. With this chart there is no backporting, you just overrride nextcloud.tag with your desired version.

provokateurin avatar Oct 26 '23 13:10 provokateurin

I'll take a look at resolving the conflicts.

With this chart there is no backporting, you just overrride nextcloud.tag with your desired version.

Ok

4censord avatar Oct 26 '23 13:10 4censord

The conflicts are solvable by simply rebasing onto the main branch. @remram44 can you rebase, please? Or if you'd be fine with that, i could open a new PR with your changes, and we close this one.

4censord avatar Oct 26 '23 14:10 4censord

If I rebase, will somebody review? I don't like to put in work for it to be ignored

remram44 avatar Oct 26 '23 14:10 remram44

@jessebot can you please take a look at this? this is a longstanding issue with this chart and currently the only way to prevent upgrade failing is to disable all probes which is not really a solution.

I'm using fluxcd gitops tool to manage my cluster, and when the helm upgrade fails because it's taking longer than the probes, fluxcd will rollback the pod to the last working state. But nextcloud doesn't want to start the last working state too because of this error:

Can't start Nextcloud because the version of the data (27.1.3.2) is higher than the docker image version (27.1.2.1) and downgrading is not supported.
Are you sure you have pulled the newest image version?

And I will be left with the pod crashlooping forever until I manually fix it using occ commands.

budimanjojo avatar Oct 29 '23 06:10 budimanjojo

And I will be left with the pod crashlooping forever until I manually fix it using occ commands.

How exactly are you fixing this? I am left on 25.0.1 currently, and have yet to actually complete an upgrade with the helm deployment.

4censord avatar Oct 29 '23 17:10 4censord

@4censord I use helm rollback nextcloud <revision> to get it back to the helm revision where the upgrade failed and get rid of the nextcloud error. Then exec into the pod and do php occ upgrade then php occ maintenance:mode --off

budimanjojo avatar Oct 30 '23 01:10 budimanjojo

@budimanjojo Then i seem to have a different issue, that does not work for me. I have to roll back a backup after attempting an upgrade.

4censord avatar Oct 30 '23 19:10 4censord

@provokateurin Would you mind taking a look at this once it's convenient? IMO its ready.

4censord avatar Nov 05 '23 23:11 4censord

Sorry for the delay. Slowly making my rounds 🙏

@remram44 I did a quick look over and it seems ok, but don't we still want to be able to set upgrade in the values.yaml?

Also, there still needs to be a rebase, as there's conflicts.

jessebot avatar Jun 09 '24 07:06 jessebot

Also, tagged @provokateurin to get an additional review to ask about keeping the upgrade parameter in the values.yaml.

jessebot avatar Jun 09 '24 07:06 jessebot

I'm not 100% sure if I read https://github.com/nextcloud/docker/blob/master/README.md correctly, but it sounds like the "normal" container will still try to do the upgrade as the default command is used there. Then we have a race condition between the two containers and depending on which one gets to work first the upgrade is killed by the probes or not.

provokateurin avatar Jun 09 '24 07:06 provokateurin

IIRK the upgrade env var is used for triggering the Nextcloud images upgrade mechanism. But because this PR completely supersedes the internal upgrade on startup stuff by running the upgrades explicitly before, the upgrade env var does not do anything any more.

Also, there still needs to be a rebase, as there's conflicts.

Yeah, back in 2023 when we said its ready there weren't.

4censord avatar Jun 09 '24 07:06 4censord

Then we have a race condition between the two containers and depending on which one gets to work first the upgrade is killed by the probes or not.

No, the init container runs first, and the normal container only gets started once the upgrade is completes. If it had the upgrade var set, it just would not do anything because its already on the lasted version

4censord avatar Jun 09 '24 07:06 4censord

You're right, I overlooked that it is an initContainer and not another sidecar. Then this makes sense to me, I'll have to give it some testing though to confirm it works as intended.

provokateurin avatar Jun 09 '24 07:06 provokateurin

If you could do a rebase onto the latest main state then I'll give it a test.

provokateurin avatar Jun 09 '24 08:06 provokateurin

We will have to wait for @remram44, it's their PR. I just have an interest in getting this merged because i run into this problem every Nextcloud upgrade.

4censord avatar Jun 09 '24 09:06 4censord

Ah yeah I didn't look who submitted the PR and thought it was you :see_no_evil:

provokateurin avatar Jun 09 '24 09:06 provokateurin

I rebased.

I am not sure how this interacts with #466/#525, I moved running the hook to the initContainer only. @wrenix should probably weigh in.

remram44 avatar Jun 09 '24 17:06 remram44

In my opinion with the hooks. It is a better solution to move it to the initContainer (that should work).

wrenix avatar Jun 09 '24 21:06 wrenix

I'd say we should have multiple init-containers run in order. That would make them both easier, and would help with debugging. I'd have them in this order:

  1. wait for DB to be up
  2. Upgrade Nextcloud version
  3. install and update apps
  4. disable maintenance mode, possible more checks.

While that adds complexity in more containers, and will increase startup time per pod, IMHO it's then easier that writing complex scripts for doing everything within one container.

4censord avatar Jun 14 '24 22:06 4censord

I'm not entirely sure what the advantage is. Multiple init containers that run the exact same image... that also means they will have to release/acquire a lock multiple time.

I don't see what you mean be "decrease startup time per pod". At least as much work has to happen, so that doesn't seem true?

remram44 avatar Jun 14 '24 23:06 remram44

decrease startup time per pod

This was meant to say "increase", fixed now.

Every container that starts takes a few seconds until it's ready to run its scripts. So if you run more containers sequentially, you have that delay multiple times until the container can check that it does not need to do anything.

Personally, I just would have done multiple containers, because that feels easier to do, rather than having to chain things in the same container. But its not a problem for me, and i wont argue against doing it another way.

4censord avatar Jun 15 '24 11:06 4censord