go-algorand icon indicating copy to clipboard operation
go-algorand copied to clipboard

New health endpoint should be added that only returns success if node is fully caught-up.

Open pbennett opened this issue 2 years ago • 4 comments

Problem

The current /health endpoint returns success basically if algod is running 'at all.' However, the node really isn't capable of accepting transactions, looking up blocks, or accounts or doing much of anything until its fully caught up.
Thus a node in the middle of catch-up (fast catchup or otherwise), shouldn't be seen as being 'ready'

To run an Algorand node in a container (Kubernetes for instance), you typically set up http[s] liveness and readiness handlers. The readiness handler determines if traffic can be sent to the container. The existing /health is fine as a liveness handler, however there needs to be either a query-parameter alteration to /health, or a new endpoint to determine if the node is actually capable of accepting real requests/transactions.

Solution

I recommend a new endpoint of /ready in addition the existing /health it should return http status of 500 (there may be another code more appropriate, which is fine) if not ready, and 200 once ready.

Adding this prevents script hacks typically needed to work around this, including additional binaries in with the algorand node (like shells, programs that could be used as attack vector [curl, etc.]. Integrators can simply define their readiness handler as pointing to the /ready endpoint and it will just work. Nodes can be configured to catch-up on startup, and no traffic will be sent to them until they're fully caught up.

pbennett avatar Jul 06 '22 02:07 pbennett

Could you ulse something like the v2/status endpoint and jq to check for catchup-time equals 0?

winder avatar Jul 28 '22 14:07 winder

See my comments on why I would never want to do this. Things like curl, etc. - or even a shell - shouldn't be part of any production container. They provide attack vectors whether via exfiltration or most importantly, ability to import foreign code trivially and execute it.

pbennett avatar Jul 28 '22 20:07 pbennett

It seems like http probes in kubernetes are standardized enough that it would be nice to directly support them.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes

On Thu, Jul 28, 2022, 4:44 PM Patrick Bennett @.***> wrote:

See my comments on why I would never want to do this. Things like curl, etc. - or even a shell - shouldn't be part of any production container. They provide attack vectors whether via exfiltration or most importantly, ability to import foreign code trivially and execute it.

— Reply to this email directly, view it on GitHub https://github.com/algorand/go-algorand/issues/4223#issuecomment-1198613657, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADL7T54DOCSYN5SQOZRDHLVWLWK5ANCNFSM52YEB5IA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jannotti avatar Oct 11 '22 08:10 jannotti

Correct, readiness and liveness handlers are critical components in k8s deployments. The startup probe 'could' be used, but not all implementations honor it w/ regards to the readiness handle (GKE and its newer native container load-balancing - which only honors readiness handlers for the ingress handlers). This would be a super simple addition to algod, but also extremely useful.

pbennett avatar Oct 11 '22 14:10 pbennett

I have actually run into this issue as well. From my experience, cloud native load balancers (outside of k8s) check for readiness through a very simple system. You define an endpoint on the VM and the load balancers checks for either a 200 (ish) response or an error/no response. I would appreciate this addition as well!

WesleyMiller1998 avatar Oct 19 '22 06:10 WesleyMiller1998

What about a database schema change? They are very rare to happen but, when they do, the backend isn't up so one cannot be aware about the state.

It happened on the past, some machines took up to half an hour to complete the process and one doesn't know if the process is hung.

mxmauro avatar Dec 01 '22 12:12 mxmauro

Exactly. Same issue - catch-up or db upgrades. 👍

pbennett avatar Dec 01 '22 12:12 pbennett

Excellent!! Thanks.

mxmauro avatar Dec 01 '22 12:12 mxmauro