kyma icon indicating copy to clipboard operation
kyma copied to clipboard

Improve liveness health check for NATS JetStream

Open mfaizanse opened this issue 2 years ago • 1 comments

As a follow up of ticket.

Pre-requisites:

  • https://github.com/kyma-project/kyma/issues/15096

Tasks:

  • [ ] [Suggestion 1] Change liveness check from / endpoint to /healthz because /healthz internally also does some health checks for JetStream server, streams and consumers.

    • [ ] Once this PR is released in new version which would allow us to config the behaviour of /healthz.
      • /healthz?js-enabled=true will return non-healthy status if JetStream is disabled on that instance.
      • /healthz?js-enabled=true&js-server-only=true will only check JetStream server but not the streams and consumers.
  • [ ] [Alternative to Suggestion 1, if /healthz is not still reliable] Have a sidecar health check container to NATS Pods, which continuously queries the /jsz and /healthz and checks in depth if the NATS instance is healthy or should it be restarted. We can use liveness check on this container.

mfaizanse avatar Aug 09 '22 08:08 mfaizanse

TODO: Check when will be the next release for NATS server. If its too late then we an go with suggestion 2.

mfaizanse avatar Aug 12 '22 12:08 mfaizanse

Also look into the readiness checks!

raypinto avatar Aug 26 '22 11:08 raypinto

This issue or PR has been automatically marked as stale due to the lack of recent activity. Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 7d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Close this issue or PR with /close

If you think that I work incorrectly, kindly raise an issue with the problem.

/lifecycle stale

kyma-bot avatar Nov 15 '22 12:11 kyma-bot

This PR introduces a fix the liveness/readiness probes.

marcobebway avatar Dec 20 '22 12:12 marcobebway

Wouldn’t it be good if also check the streams and consumers in the liveness check by using /healthz ?

  • JetStream can be for example recovering which can take some time. That still a properly running server. Killing it could restart / interrupt the whole process.
  • Also during normal running its very often in “recovering” state just as things fall behind under load etc, still fully healthy.

mfaizanse avatar Jan 17 '23 08:01 mfaizanse

  • https://github.com/nats-io/k8s/issues/594
  • https://github.com/nats-io/k8s/issues/622

mfaizanse avatar Jan 19 '23 08:01 mfaizanse

Questions:

  • Is downtime okay with statefulset recreation?
  • Would PVCs will remain and no data loss.

mfaizanse avatar Jan 19 '23 10:01 mfaizanse

Next Todo's:

  • [x] Bring back eventing reconciler.
  • [x] Add a pre-action to remove the NATs statefulset if it has podManagementPolicy = OrderedReady, so that new Statefulset is created by NATs helm.
  • [x] Make sure the PVCs are not deleted during the upgrade process. No data loss.
  • [x] Test the upgrade to this PR.
  • [ ] Revert PR: https://github.com/kyma-project/control-plane/pull/2332 and bump image for eventing reconciler.

mfaizanse avatar Jan 23 '23 12:01 mfaizanse

Old PRs:

  • https://github.com/kyma-incubator/reconciler/pull/1151
  • https://github.com/kyma-project/control-plane/pull/2332

mfaizanse avatar Jan 24 '23 09:01 mfaizanse

Waiting for the new kyma-cli release with the eventing controller, so the upgrade job for the last PR succeed.

mfaizanse avatar Feb 06 '23 15:02 mfaizanse