k8s icon indicating copy to clipboard operation
k8s copied to clipboard

helm: Make liveness check JS aware

Open wallyqs opened this issue 4 years ago • 4 comments

The liveness check should take into account whether JS is enabled or not in case it was explicitly enabled.

wallyqs avatar Sep 21 '21 16:09 wallyqs

This was accomplished by the Startup Probe, right? Can we close?

caleblloyd avatar Mar 09 '22 15:03 caleblloyd

The JetStream engine can shutdown for various reasons while the server is running, for example if it failed to write to disk for some reasons then JetStream will shutdown but leave the server running to serve any other non-JetStream traffic.

I think the idea would be for the liveness check here for JS would be to detect whether this has happened, which would be reflected in https://demo.nats.io:8222/jsz showing disabled as below, and in case this occurs then make k8s restart the server via the liveness check.

 curl http://localhost:8222/jsz
{
  "server_id": "NA7OQBKW534L6M526NKMPABJPKULXYHVTXZSZYKFK52FJGVXOTD3SHBE",
  "now": "2022-03-09T17:13:26.593755Z",
  "disabled": true,
  "config": {
    "max_memory": 0,
    "max_storage": 0
  }

An alternative might be to introduce an option in the server to "shutdown on JetStream becoming disabled due to an error" or more advanced behaviors like disconnect consumers that are using JetStream when that happens instead of doing it via a k8s healthcheck.

wallyqs avatar Mar 09 '22 17:03 wallyqs

Killing a partially-working service with the Liveness probe can be a double-edged sword. Let's take the Disk space filling up as an example:

Pro: The pod will enter a Crash Loop Backoff, and if a Cluster Administrator has alerting set up they may be alerted to the fact that there is a problem with that particular pod.

Con: Non-jetstream related functionality will still be working on this Pod, and restarting it with a liveness probe will not help fix the problem.

caleblloyd avatar Mar 09 '22 17:03 caleblloyd

Agree, it depends on the context of the situation of the system. There used to be a bug in the server where the JS service would auto shutdown for example due to attempting a write into a missing directory so a restart would have helped with recovery of the system in that case, but most of those issues have been resolved since then.

wallyqs avatar Mar 09 '22 18:03 wallyqs

It would be nice to make it configurable in some way. That way one can decide what one like deployment by deployment.

svallebro avatar Jan 21 '23 19:01 svallebro

This works now in the healthz-based Liveness probe

caleblloyd avatar Apr 17 '23 16:04 caleblloyd