nats-server Provide Health Checks for external Systems

Feature Request

A number of external systems could utilize introspection into the readiness and liveness of the NATS server, such as K8s and others (see https://github.com/nats-io/nats-server/issues/1903). This will provide much better UX for K8s users and reduce errors on startup, resource issues, and loss of quorum.

Suggestions for Discussion

Check	State	Suggested Endpoint
Startup (Core NATS)	Servers are Ready	`/healthz?current-cluster-size=N`
Startup (JetStream)	Servers are Ready	`/healthz?current-cluster-size=N&quorum=true`
Readiness (Core NATS)	Accepting Client Connections	`/healthz`
Readiness (JetStream)	Accepting Client Connections & Is a caught up leader or follower*	`/healthz?isCandidate=false`
Liveness (Core NATS)	N/A (Server will stop on it own)	`/healthz`
Liveness (JetStream)	Jetstream Subsystem is Running	`/healthz?js-enabled=true`

*Not sure if readiness failures would prevent cluster traffic (TBD).

Liveness (JetStream) would fail if the JetStream subsystem has been shutdown due to lack of resources, unavailable PVC, etc.

The endpoints would return 200 if successful.

Startup, Liveness, and Readiness probes would significantly help in terms of startup and potentially reduce time to problem resolution (especially the Liveness probe constrained resources in k8s).

This may not be correct but I hope to spur discussion and am looking for community feedback in this area.

CC @nats-io/core @wallyqs @ripienaar

Nov 10 '21 20:11 ColinSullivan1

I think while the current probes which simply check whether 8222:/ is responding with a 200 OK or not, granular health statuses are always recommended.

It would be great if this is done. Can I help?

Dec 06 '21 09:12 c16a

We feel this will be resolved by #2815; additional testing will determine if that PR covers everything we need.

@c16a , thank you so much for the offer to help - much appreciated! I think we have this covered. There are plenty of issues open for contributors; don't hesitate to reach out if you find one that interests you.

Jan 26 '22 15:01 ColinSullivan1

Do we have a health check endpoint which gives us the cluster health? Currently I have a NATS cluster which has say (n) number of servers in it. My understanding is that /healthz gives the NATS server health check and not of the entire cluster.

Dec 03 '23 12:12 Himani2000

I would suggest using the NATS cli. You need to have a system account access.

nats server check meta --expect=9 --lag-critical=5 --seen-critical=1s

Dec 03 '23 17:12 derekcollison

nats-server nats-server copied to clipboard

Provide Health Checks for external Systems

Feature Request

Suggestions for Discussion

nats-server
nats-server copied to clipboard