kyma Improve liveness health check for NATS JetStream

As a follow up of ticket.

Pre-requisites:

Tasks:

[ ] [Suggestion 1] Change liveness check from / endpoint to /healthz because /healthz internally also does some health checks for JetStream server, streams and consumers.
- [ ] Once this PR is released in new version which would allow us to config the behaviour of /healthz.
  - /healthz?js-enabled=true will return non-healthy status if JetStream is disabled on that instance.
  - /healthz?js-enabled=true&js-server-only=true will only check JetStream server but not the streams and consumers.
[ ] [Alternative to Suggestion 1, if /healthz is not still reliable] Have a sidecar health check container to NATS Pods, which continuously queries the /jsz and /healthz and checks in depth if the NATS instance is healthy or should it be restarted. We can use liveness check on this container.

Aug 09 '22 08:08 mfaizanse

TODO: Check when will be the next release for NATS server. If its too late then we an go with suggestion 2.

Aug 12 '22 12:08 mfaizanse

Also look into the readiness checks!

Aug 26 '22 11:08 raypinto

This issue or PR has been automatically marked as stale due to the lack of recent activity. Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

You can:

If you think that I work incorrectly, kindly raise an issue with the problem.

/lifecycle stale

Nov 15 '22 12:11 kyma-bot

This PR introduces a fix the liveness/readiness probes.

Dec 20 '22 12:12 marcobebway

Wouldn’t it be good if also check the streams and consumers in the liveness check by using /healthz ?

JetStream can be for example recovering which can take some time. That still a properly running server. Killing it could restart / interrupt the whole process.
Also during normal running its very often in “recovering” state just as things fall behind under load etc, still fully healthy.

Jan 17 '23 08:01 mfaizanse

Jan 19 '23 08:01 mfaizanse

Questions:

Jan 19 '23 10:01 mfaizanse

Next Todo's:

[x] Bring back eventing reconciler.
[x] Add a pre-action to remove the NATs statefulset if it has podManagementPolicy = OrderedReady, so that new Statefulset is created by NATs helm.
[x] Make sure the PVCs are not deleted during the upgrade process. No data loss.
[x] Test the upgrade to this PR.
[ ] Revert PR: https://github.com/kyma-project/control-plane/pull/2332 and bump image for eventing reconciler.

Jan 23 '23 12:01 mfaizanse

Old PRs:

Jan 24 '23 09:01 mfaizanse

Waiting for the new kyma-cli release with the eventing controller, so the upgrade job for the last PR succeed.

Feb 06 '23 15:02 mfaizanse

kyma kyma copied to clipboard