helm
helm copied to clipboard
Liveness probe kills pod while upgrading
It is really unclear how you mean for NextCloud to ever upgrade. There are no instructions #221 and automatic upgrade probably can't work since all replicas would race against each other, and pods get killed while doing it because of failing liveness checks.
How can this work?
I haven't tried a manual upgrade yet, but I'm interested in your findings, because we can then create a PR to guide others as well. If not, I'll update this issue when I do get around to testing an upgrade and share my findings as well.
In case it's helpful: You can disable probes, but I'm not sure that's what you want. It took me a few tries to find settings that worked for my infra for at least the initial deployment, but I currently update these values in my values.yaml:
## Liveness and readiness probe values
## Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
##
livenessProbe:
enabled: true
initialDelaySeconds: 45
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
readinessProbe:
enabled: true
initialDelaySeconds: 45
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
startupProbe:
enabled: false
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30
successThreshold: 1
And if you're deploying directly to your cluster, you can edit the probe values, however, the time it takes will depend on your infrastructure, and so I'm not sure what values would be best for you, but you could do a helm upgrade to apply the new values, or you could modify the deployment directly.
Replicas can't race against each other because there is a locking mechanism implemented in the nextcloud docker image (See https://github.com/nextcloud/docker/blob/master/25/apache/entrypoint.sh#L127-L240). I never had problems with the probes on my infra, but the default values should definitely work on most hardware, so we should change them accordingly. @remram44 If you could experiment with the values until they work we could compare them with the ones from @jessebot and take the most forgiving ones of each of you.
I can change replicas to 1 and remove the probes before I upgrade then reset it, but it would probably be better if it worked out-of-the-box. I don't know the consequences of an upgrade getting killed half-way, it probably shouldn't be something that happens in normal operations.
It would probably be better to either:
- Have the probes pass while the upgrade is happening. This is a little tricky because the upgrade happens in the Docker image's entrypoint, and no server is running at that point to reply to GET requests
- Use a Helm hook to do the upgrade outside of the deployment proper. This way you don't need probes, and you can fail the helm upgrade if the upgrade fail
- Use an init container to do the upgrade, which is not subject to probes
About the number of replicas, there is a locking mechanism in the image (primitive -- not sure why it doesn't use flock?), but it needs to be enabled by setting NEXTCLOUD_INIT_LOCK. This should probably be set in the Helm chart.
@provokateurin NEXTCLOUD_INIT_LOCK is NOT set by default as per the README:
NEXTCLOUD_INIT_LOCK(not set by default) Set it to true to enable initialization locking. Other containers will wait for the current process to finish updating the html volume to continue.
About the number of replicas, there is a locking mechanism in the image (primitive -- not sure why it doesn't use flock?), but it needs to be enabled by setting NEXTCLOUD_INIT_LOCK. This should probably be set in the Helm chart.
@remram44 would you like to create a PR for this change?
Env vars are created here: https://github.com/nextcloud/helm/blob/a491977458b5111cc0afd06386aa63ef46720f4c/charts/nextcloud/templates/_helpers.tpl#L65-L72
And then you'd need to update the config parameter table in our README.md and also the values.yaml, likely around here with the env var and a comment about it:
https://github.com/nextcloud/helm/blob/a491977458b5111cc0afd06386aa63ef46720f4c/charts/nextcloud/values.yaml#L72-L85
Sure, opened #343. Note that this doesn't fix this issue, we should still do something about the probe.
That makes sense, but that should probably be second PR. I haven't worked with helm hooks before, so I'd need to do some research, but if you or anyone else knows the fix off the top of your head, you can submit a PR and we will review it.
Note: I work on this in my spare time, and I've been on vacation, so I've been pretty busy in this repo, but I will be going back to work next week, so my response time may vary 🙏 I really appreciate your patience though. It feels great to see this repo improving :)
I think the easiest way to do this is to use an init container. The Docker image looks like it supports doing the upgrade and then exiting, you just need to pass "true" as the command. I can work on a pull request.
Awesome, thank you for working on this :) Please just link back to the relevant docker lines you're referencing in the PR, so others can quickly follow along and get up to speed.
Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts
Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts
For now, disable the probes or increase the timeout on them. In the meantime, I've added follow up comments to #344 to potentially solve this in the future.
Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts
For now, disable the probes or increase the timeout on them. In the meantime, I've added follow up comments to #344 to potentially solve this in the future.
I think a startup probe works well here, at least it seems to have worked on the latest upgrade.
startupProbe:
enabled: true
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 100
Edit: never mind, was assuming latest helm chart changes had a nextcloud version change but not really. Will wait for next upgrade to see if it actually works
I think a startup probe works well here, at least it seems to have worked on the latest upgrade.
startupProbe: enabled: true periodSeconds: 5 timeoutSeconds: 5 failureThreshold: 100
Edit: never mind, was assuming latest helm chart changes had a nextcloud version change but not really. Will wait for next upgrade to see if it actually works
BTW: I tried this (as in: I added this to the values file) as I also had this problem on update and it looks like it worked:
Normal Scheduled 3m37s default-scheduler Successfully assigned nextcloud/nextcloud-8c4f8cbdb-6mmsk to 743be977-f515-4ba0-96e7-499bfbdd5bf3
Normal Pulling 2m50s kubelet Pulling image "nextcloud:31.0.1-apache"
Normal Pulled 2m30s kubelet Successfully pulled image "nextcloud:31.0.1-apache" in 19.829s (19.829s including waiting). Image size: 465693851 bytes.
Normal Created 2m30s kubelet Created container nextcloud
Normal Started 2m29s kubelet Started container nextcloud
Normal Pulling 2m29s kubelet Pulling image "nextcloud:31.0.1-apache"
Normal Pulled 2m29s kubelet Successfully pulled image "nextcloud:31.0.1-apache" in 880ms (880ms including waiting). Image size: 465693851 bytes.
Normal Created 2m28s kubelet Created container nextcloud-cron
Normal Started 2m28s kubelet Started container nextcloud-cron
Warning Unhealthy 80s (x8 over 115s) kubelet Startup probe failed: Get "http://10.42.0.202:80/status.php": dial tcp 10.42.0.202:80: connect: connection refused
New version of Nextcloud was running fine.