zebra icon indicating copy to clipboard operation
zebra copied to clipboard

Implement proper `zebrad` health check endpoint

Open dconnolly opened this issue 3 years ago • 7 comments

We currently expose and leverage the tracing configuration endpoint as our health check when we deploy our Docker images to gcloud, hitting it with a GET and expecting a 200 with certain latency params to consider the zebrad start process healthy. This endpoint is unauthenticated, so exposing it is also leaving open a pathway that changes how we log tracing data, to the world.

Instead, we should have a stand-alone health check endpoint that just replies to GET's with the appropriate response code (or you know, nothing if the node isn't really up), with perhaps some metadata about the state of the process, but right now we don't depend on anything besides a 200 in ~10 seconds.

We can literally copy the tracing endpoint impl as an Abscissa component and remove the handling for POSTs and anything more complex than 'i am up and responsive'.

  • [ ] impl health check endpoint, listening on 8080
  • [ ] make it off by default? either way make it configurable on/off via our Config
  • [ ] Update the Dockerfile to not expose the tracing endpoint via port 3000 and not enable the endpoint in our generated config
  • [ ] Update the Dockerfile to expose port 8080 and (if not enabled by default) enable the health check endpoint via our generated Config passed to the zebrad start command

dconnolly avatar Sep 08 '20 05:09 dconnolly

Sounds good!

Just a few questions and comments:

with perhaps some metadata about the state of the process, but right now we don't depend on anything besides a 200 in ~10 seconds.

What metadata would be useful? How much metadata should go in health check, rather than metrics? How should we decide?

We can literally copy the tracing endpoint impl as an Abscissa component and remove the handling for POSTs and anything more complex than 'i am up and responsive'.

I think that endpoint has some locking bugs right now, see #1001. So we might not want to duplicate it. (Or we might find interesting bugs when we duplicate it.)

Let's also make a copy of the tracing endpoint acceptance test? We don't want to deploy code with a broken health check endpoint.

make it off by default? either way make it configurable on/off via our Config

I think off by default is the best. Then we don't have to choose a safe default IP address, and we don't have to choose between IPv4 and IPv6.

(Also, adding a new listener that's on by default will break our parallel acceptance tests - that's fixable, but annoying.)

teor2345 avatar Sep 08 '20 07:09 teor2345

Does the health check have to be unauthenticated, or does that just make it more convenient?

Would we want the health check to be a standalone endpoint, or part of a more general RPC endpoint? We haven't made any decisions about what kind of RPC we want to support for chain data, and there are choices about whether and to what extent we want to be compatible with Zcashd.

But it seems likely that we'll want to have Zebra-specific RPC methods that aren't related to the chain state or chain data, but are just related to the operation of the node itself. For instance, there's the health check, tracing controls (currently just the filter but in the future, potentially other things like dynamic flamegraphs, etc). Do we want to create a common "Node RPC" endpoint for this, distinct from the future "Chain RPC"?

hdevalence avatar Sep 08 '20 20:09 hdevalence

Does the health check have to be unauthenticated, or does that just make it more convenient?

Would we want the health check to be a standalone endpoint, or part of a more general RPC endpoint? We haven't made any decisions about what kind of RPC we want to support for chain data, and there are choices about whether and to what extent we want to be compatible with Zcashd.

But it seems likely that we'll want to have Zebra-specific RPC methods that aren't related to the chain state or chain data, but are just related to the operation of the node itself. For instance, there's the health check, tracing controls (currently just the filter but in the future, potentially other things like dynamic flamegraphs, etc). Do we want to create a common "Node RPC" endpoint for this, distinct from the future "Chain RPC"?

For the health check that monitors whether the instances in our instance groups are healthy, it has to be unauthenticated.

For a health check like this, it really should be standalone. It should be bare-bones, not parsing queries, very simple. It's to answer "is the application alive?" or not, and that's about it. If we want to do more 'Node RPC' stuff, that's a different ball of wax.

It's not part of the application, basically, it's a meta part to see if the thing is operational that is useful for integrating with other services (Da Cloud ☁️ )

dconnolly avatar Sep 08 '20 20:09 dconnolly

The locking bugs for the endpoint are caused by trying to access Abscissa state. So if the health check endpoint doesn't access that state, it should be fine.

hdevalence avatar Sep 08 '20 21:09 hdevalence

The locking bugs for the endpoint are caused by trying to access Abscissa state. So if the health check endpoint doesn't access that state, it should be fine.

Yeah fairly certain we don't need that, at least for this.

dconnolly avatar Sep 08 '20 21:09 dconnolly

Per #1030 we may want to move from Abscissa components to just "components" that are started up by the start command's main future. In that case, it wouldn't be necessary to deal with Abscissa, just producing some function that constructs a server task would work.

hdevalence avatar Sep 08 '20 21:09 hdevalence

down with dependency injection! sized the means of future production!

yaahc avatar Sep 08 '20 23:09 yaahc

Let's open a specific ticket when we actually need a health check endpoint.

teor2345 avatar Aug 25 '22 23:08 teor2345