helm icon indicating copy to clipboard operation
helm copied to clipboard

Liveness probe kills pod while upgrading

Open remram44 opened this issue 2 years ago • 13 comments

It is really unclear how you mean for NextCloud to ever upgrade. There are no instructions #221 and automatic upgrade probably can't work since all replicas would race against each other, and pods get killed while doing it because of failing liveness checks.

How can this work?

remram44 avatar Jan 19 '23 20:01 remram44

I haven't tried a manual upgrade yet, but I'm interested in your findings, because we can then create a PR to guide others as well. If not, I'll update this issue when I do get around to testing an upgrade and share my findings as well.

In case it's helpful: You can disable probes, but I'm not sure that's what you want. It took me a few tries to find settings that worked for my infra for at least the initial deployment, but I currently update these values in my values.yaml:

## Liveness and readiness probe values
## Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
##
livenessProbe:
  enabled: true
  initialDelaySeconds: 45
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 1

readinessProbe:
  enabled: true
  initialDelaySeconds: 45
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 1

startupProbe:
  enabled: false
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 30
  successThreshold: 1

And if you're deploying directly to your cluster, you can edit the probe values, however, the time it takes will depend on your infrastructure, and so I'm not sure what values would be best for you, but you could do a helm upgrade to apply the new values, or you could modify the deployment directly.

jessebot avatar Jan 24 '23 15:01 jessebot

Replicas can't race against each other because there is a locking mechanism implemented in the nextcloud docker image (See https://github.com/nextcloud/docker/blob/master/25/apache/entrypoint.sh#L127-L240). I never had problems with the probes on my infra, but the default values should definitely work on most hardware, so we should change them accordingly. @remram44 If you could experiment with the values until they work we could compare them with the ones from @jessebot and take the most forgiving ones of each of you.

provokateurin avatar Jan 24 '23 15:01 provokateurin

I can change replicas to 1 and remove the probes before I upgrade then reset it, but it would probably be better if it worked out-of-the-box. I don't know the consequences of an upgrade getting killed half-way, it probably shouldn't be something that happens in normal operations.

It would probably be better to either:

  • Have the probes pass while the upgrade is happening. This is a little tricky because the upgrade happens in the Docker image's entrypoint, and no server is running at that point to reply to GET requests
  • Use a Helm hook to do the upgrade outside of the deployment proper. This way you don't need probes, and you can fail the helm upgrade if the upgrade fail
  • Use an init container to do the upgrade, which is not subject to probes

About the number of replicas, there is a locking mechanism in the image (primitive -- not sure why it doesn't use flock?), but it needs to be enabled by setting NEXTCLOUD_INIT_LOCK. This should probably be set in the Helm chart.

remram44 avatar Jan 24 '23 15:01 remram44

@provokateurin NEXTCLOUD_INIT_LOCK is NOT set by default as per the README:

NEXTCLOUD_INIT_LOCK (not set by default) Set it to true to enable initialization locking. Other containers will wait for the current process to finish updating the html volume to continue.

remram44 avatar Jan 24 '23 15:01 remram44

About the number of replicas, there is a locking mechanism in the image (primitive -- not sure why it doesn't use flock?), but it needs to be enabled by setting NEXTCLOUD_INIT_LOCK. This should probably be set in the Helm chart.

@remram44 would you like to create a PR for this change?

Env vars are created here: https://github.com/nextcloud/helm/blob/a491977458b5111cc0afd06386aa63ef46720f4c/charts/nextcloud/templates/_helpers.tpl#L65-L72

And then you'd need to update the config parameter table in our README.md and also the values.yaml, likely around here with the env var and a comment about it:

https://github.com/nextcloud/helm/blob/a491977458b5111cc0afd06386aa63ef46720f4c/charts/nextcloud/values.yaml#L72-L85

jessebot avatar Jan 26 '23 23:01 jessebot

Sure, opened #343. Note that this doesn't fix this issue, we should still do something about the probe.

remram44 avatar Jan 27 '23 03:01 remram44

That makes sense, but that should probably be second PR. I haven't worked with helm hooks before, so I'd need to do some research, but if you or anyone else knows the fix off the top of your head, you can submit a PR and we will review it.

Note: I work on this in my spare time, and I've been on vacation, so I've been pretty busy in this repo, but I will be going back to work next week, so my response time may vary 🙏 I really appreciate your patience though. It feels great to see this repo improving :)

jessebot avatar Jan 27 '23 08:01 jessebot

I think the easiest way to do this is to use an init container. The Docker image looks like it supports doing the upgrade and then exiting, you just need to pass "true" as the command. I can work on a pull request.

remram44 avatar Jan 27 '23 16:01 remram44

Awesome, thank you for working on this :) Please just link back to the relevant docker lines you're referencing in the PR, so others can quickly follow along and get up to speed.

jessebot avatar Jan 27 '23 16:01 jessebot

Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts

davidfrickert avatar Jun 06 '24 23:06 davidfrickert

Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts

For now, disable the probes or increase the timeout on them. In the meantime, I've added follow up comments to #344 to potentially solve this in the future.

jessebot avatar Jun 09 '24 07:06 jessebot

Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts

For now, disable the probes or increase the timeout on them. In the meantime, I've added follow up comments to #344 to potentially solve this in the future.

I think a startup probe works well here, at least it seems to have worked on the latest upgrade.

startupProbe: 
    enabled: true
    periodSeconds: 5 
    timeoutSeconds: 5 
    failureThreshold: 100

Edit: never mind, was assuming latest helm chart changes had a nextcloud version change but not really. Will wait for next upgrade to see if it actually works

davidfrickert avatar Jun 09 '24 13:06 davidfrickert

I think a startup probe works well here, at least it seems to have worked on the latest upgrade.

startupProbe: enabled: true periodSeconds: 5 timeoutSeconds: 5 failureThreshold: 100

Edit: never mind, was assuming latest helm chart changes had a nextcloud version change but not really. Will wait for next upgrade to see if it actually works

BTW: I tried this (as in: I added this to the values file) as I also had this problem on update and it looks like it worked:

  Normal   Scheduled  3m37s               default-scheduler  Successfully assigned nextcloud/nextcloud-8c4f8cbdb-6mmsk to 743be977-f515-4ba0-96e7-499bfbdd5bf3
  Normal   Pulling    2m50s               kubelet            Pulling image "nextcloud:31.0.1-apache"
  Normal   Pulled     2m30s               kubelet            Successfully pulled image "nextcloud:31.0.1-apache" in 19.829s (19.829s including waiting). Image size: 465693851 bytes.
  Normal   Created    2m30s               kubelet            Created container nextcloud
  Normal   Started    2m29s               kubelet            Started container nextcloud
  Normal   Pulling    2m29s               kubelet            Pulling image "nextcloud:31.0.1-apache"
  Normal   Pulled     2m29s               kubelet            Successfully pulled image "nextcloud:31.0.1-apache" in 880ms (880ms including waiting). Image size: 465693851 bytes.
  Normal   Created    2m28s               kubelet            Created container nextcloud-cron
  Normal   Started    2m28s               kubelet            Started container nextcloud-cron
  Warning  Unhealthy  80s (x8 over 115s)  kubelet            Startup probe failed: Get "http://10.42.0.202:80/status.php": dial tcp 10.42.0.202:80: connect: connection refused

New version of Nextcloud was running fine.

Frankst2 avatar Mar 15 '25 21:03 Frankst2