stacks-core icon indicating copy to clipboard operation
stacks-core copied to clipboard

Feat/health check endpoint

Open pavitthrap opened this issue 4 years ago • 3 comments

Description

Fixes #1229

The get health endpoint first gets a list of all valid initial peers of the node. It then gets the max height amongst those nodes. If the block height of the node is greater than or equal to that max height, the health check endpoint returns true for the field is_healthy, and false otherwise. The endpoint also returns the fraction (block height of node/ max height of initial peers) as a percentage.

In addition to the rpc endpoint, I updated some function and variable names in http.rs for consistency (because there is a discrepancy between newly added rpc calls and older ones). I switched older methods to use pure snake case in naming (so get_poxinfo would now be get_pox_info). Let me know if this change is unwanted, I can switch it back. If it is wanted, I will make a similar change in rpc.rs as well.

To think about:

  • It's possible a node isn't keeping track of the block stats for any of its initial peers - might need to consider fallback scenario (where we consider the block heights of any node we are currently keeping the stats of)

Type of Change

  • [x] New feature
  • [ ] Bug fix
  • [ ] API reference/documentation update
  • [ ] Other

Are documentation updates required?

Added docs to existing rpc documentation.

Testing information

Added typical RPC test for this. Mocked out block stats for the peer server.

Need to test this on a live node.

pavitthrap avatar Jul 14 '21 21:07 pavitthrap

@wileyj @CharlieC3 Wondering what your thoughts are on the response type in handle_get_health (in rpc.rs). Would you prefer

  1. always getting a HttpResponseType of GetHealth, with a detailed error field
  2. always want a 400 level error if either there is an issue querying for the health or if the node is not healthy
  3. a 400 level error if there is an issue querying for health, and a GetHealth response otherwise (includes instances when query succeeds, but it is determined the node is not healthy)

pavitthrap avatar Jul 16 '21 20:07 pavitthrap

@wileyj @CharlieC3 Wondering what your thoughts are on the response type in handle_get_health (in rpc.rs). Would you prefer

  1. always getting a HttpResponseType of GetHealth, with a detailed error field
  2. always want a 400 level error if either there is an issue querying for the health or if the node is not healthy
  3. a 400 level error if there is an issue querying for health, and a GetHealth response otherwise (includes instances when query succeeds, but it is determined the node is not healthy)

Agreed with @wileyj.

We plan on modeling the health endpoint to be compatible with basic standards that health-checking tools align on, and typically that's status codes between 200 and 399 means the service is ready to be used, and anything above 400 means it's not. So while a 500 could mean the stacks-node is indefinitely stalled or it's just not synced to the stacks tip, the distinction between these two doesn't matter much because the action we'd need to take for either of them would be the same.

The plan is to have a liveliness probe and a readiness probe, each with different functions. The liveliness probe would check the health endpoint for a passing or failing HTTP status code. The liveliness probe would have to fail N times over a probing interval of M minutes before it restarts the stacks-node.

The readiness probe usually has a lower threshold before it takes action. So it may have to fail (1/2 * N) times over a probing interval of M minutes before it stops allowing traffic to reach the stacks-node, but not restart it yet until the liveliness probe triggers later.

CharlieC3 avatar Jul 16 '21 20:07 CharlieC3

@wileyj @CharlieC3 Wondering what your thoughts are on the response type in handle_get_health (in rpc.rs). Would you prefer

  1. always getting a HttpResponseType of GetHealth, with a detailed error field
  2. always want a 400 level error if either there is an issue querying for the health or if the node is not healthy
  3. a 400 level error if there is an issue querying for health, and a GetHealth response otherwise (includes instances when query succeeds, but it is determined the node is not healthy)

Agreed with @wileyj.

We plan on modeling the health endpoint to be compatible with basic standards that health-checking tools align on, and typically that's status codes between 200 and 399 means the service is ready to be used, and anything above 400 means it's not. So while a 500 could mean the stacks-node is indefinitely stalled or it's just not synced to the stacks tip, the distinction between these two doesn't matter much because the action we'd need to take for either of them would be the same.

The plan is to have a liveliness probe and a readiness probe, each with different functions. The liveliness probe would check the health endpoint for a passing or failing HTTP status code. The liveliness probe would have to fail N times over a probing interval of M minutes before it restarts the stacks-node.

The readiness probe usually has a lower threshold before it takes action. So it may have to fail (1/2 * N) times over a probing interval of M minutes before it stops allowing traffic to reach the stacks-node, but not restart it yet until the liveliness probe triggers later.

yes, 4XX can be used for sure - but in this case i would argue 5XX is more appropriate and standard for what we're expecting. That said, there is no concrete rule on what to use here @pavitthrap

wileyj avatar Jul 19 '21 14:07 wileyj

Re-implemented in #3729

pavitthrap avatar May 26 '23 18:05 pavitthrap