kyma
kyma copied to clipboard
Improve liveness health check for NATS JetStream
As a follow up of ticket.
Pre-requisites:
- https://github.com/kyma-project/kyma/issues/15096
Tasks:
-
[ ] [Suggestion 1] Change liveness check from
/
endpoint to/healthz
because/healthz
internally also does some health checks for JetStream server, streams and consumers.- [ ] Once this PR is released in new version which would allow us to config the behaviour of
/healthz
.-
/healthz?js-enabled=true
will return non-healthy status if JetStream is disabled on that instance. -
/healthz?js-enabled=true&js-server-only=true
will only check JetStream server but not the streams and consumers.
-
- [ ] Once this PR is released in new version which would allow us to config the behaviour of
-
[ ] [Alternative to Suggestion 1, if
/healthz
is not still reliable] Have a sidecar health check container to NATS Pods, which continuously queries the/jsz
and/healthz
and checks in depth if the NATS instance is healthy or should it be restarted. We can use liveness check on this container.
TODO: Check when will be the next release for NATS server. If its too late then we an go with suggestion 2.
Also look into the readiness checks!
This issue or PR has been automatically marked as stale due to the lack of recent activity. Thank you for your contributions.
This bot triages issues and PRs according to the following rules:
- After 60d of inactivity,
lifecycle/stale
is applied - After 7d of inactivity since
lifecycle/stale
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Close this issue or PR with
/close
If you think that I work incorrectly, kindly raise an issue with the problem.
/lifecycle stale
This PR introduces a fix the liveness/readiness probes.
Wouldn’t it be good if also check the streams and consumers in the liveness check by using /healthz
?
- JetStream can be for example recovering which can take some time. That still a properly running server. Killing it could restart / interrupt the whole process.
- Also during normal running its very often in “recovering” state just as things fall behind under load etc, still fully healthy.
- https://github.com/nats-io/k8s/issues/594
- https://github.com/nats-io/k8s/issues/622
Questions:
- Is downtime okay with statefulset recreation?
- Would PVCs will remain and no data loss.
Next Todo's:
- [x] Bring back eventing reconciler.
- [x] Add a pre-action to remove the NATs statefulset if it has
podManagementPolicy = OrderedReady
, so that new Statefulset is created by NATs helm. - [x] Make sure the PVCs are not deleted during the upgrade process. No data loss.
- [x] Test the upgrade to this PR.
- [ ] Revert PR: https://github.com/kyma-project/control-plane/pull/2332 and bump image for eventing reconciler.
Old PRs:
- https://github.com/kyma-incubator/reconciler/pull/1151
- https://github.com/kyma-project/control-plane/pull/2332
Waiting for the new kyma-cli
release with the eventing controller, so the upgrade job for the last PR succeed.