nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

Provide Health Checks for external Systems

Open ColinSullivan1 opened this issue 2 years ago • 4 comments

Feature Request

A number of external systems could utilize introspection into the readiness and liveness of the NATS server, such as K8s and others (see https://github.com/nats-io/nats-server/issues/1903). This will provide much better UX for K8s users and reduce errors on startup, resource issues, and loss of quorum.

Suggestions for Discussion

Check State Suggested Endpoint
Startup (Core NATS) Servers are Ready /healthz?current-cluster-size=N
Startup (JetStream) Servers are Ready /healthz?current-cluster-size=N&quorum=true
Readiness (Core NATS) Accepting Client Connections /healthz
Readiness (JetStream) Accepting Client Connections & Is a caught up leader or follower* /healthz?isCandidate=false
Liveness (Core NATS) N/A (Server will stop on it own) /healthz
Liveness (JetStream) Jetstream Subsystem is Running /healthz?js-enabled=true

*Not sure if readiness failures would prevent cluster traffic (TBD).

Liveness (JetStream) would fail if the JetStream subsystem has been shutdown due to lack of resources, unavailable PVC, etc.

The endpoints would return 200 if successful.

Startup, Liveness, and Readiness probes would significantly help in terms of startup and potentially reduce time to problem resolution (especially the Liveness probe constrained resources in k8s).

This may not be correct but I hope to spur discussion and am looking for community feedback in this area.

CC @nats-io/core @wallyqs @ripienaar

ColinSullivan1 avatar Nov 10 '21 20:11 ColinSullivan1

I think while the current probes which simply check whether 8222:/ is responding with a 200 OK or not, granular health statuses are always recommended.

It would be great if this is done. Can I help?

c16a avatar Dec 06 '21 09:12 c16a

We feel this will be resolved by #2815; additional testing will determine if that PR covers everything we need.

@c16a , thank you so much for the offer to help - much appreciated! I think we have this covered. There are plenty of issues open for contributors; don't hesitate to reach out if you find one that interests you.

ColinSullivan1 avatar Jan 26 '22 15:01 ColinSullivan1

Do we have a health check endpoint which gives us the cluster health? Currently I have a NATS cluster which has say (n) number of servers in it. My understanding is that /healthz gives the NATS server health check and not of the entire cluster.

Himani2000 avatar Dec 03 '23 12:12 Himani2000

I would suggest using the NATS cli. You need to have a system account access.

nats server check meta --expect=9 --lag-critical=5 --seen-critical=1s

derekcollison avatar Dec 03 '23 17:12 derekcollison