go-algorand New health endpoint should be added that only returns success if node is fully caught-up.

Problem

The current /health endpoint returns success basically if algod is running 'at all.' However, the node really isn't capable of accepting transactions, looking up blocks, or accounts or doing much of anything until its fully caught up.
Thus a node in the middle of catch-up (fast catchup or otherwise), shouldn't be seen as being 'ready'

To run an Algorand node in a container (Kubernetes for instance), you typically set up http[s] liveness and readiness handlers. The readiness handler determines if traffic can be sent to the container. The existing /health is fine as a liveness handler, however there needs to be either a query-parameter alteration to /health, or a new endpoint to determine if the node is actually capable of accepting real requests/transactions.

Solution

I recommend a new endpoint of /ready in addition the existing /health it should return http status of 500 (there may be another code more appropriate, which is fine) if not ready, and 200 once ready.

Adding this prevents script hacks typically needed to work around this, including additional binaries in with the algorand node (like shells, programs that could be used as attack vector [curl, etc.]. Integrators can simply define their readiness handler as pointing to the /ready endpoint and it will just work. Nodes can be configured to catch-up on startup, and no traffic will be sent to them until they're fully caught up.

Jul 06 '22 02:07 pbennett

Could you ulse something like the v2/status endpoint and jq to check for catchup-time equals 0?

Jul 28 '22 14:07 winder

See my comments on why I would never want to do this. Things like curl, etc. - or even a shell - shouldn't be part of any production container. They provide attack vectors whether via exfiltration or most importantly, ability to import foreign code trivially and execute it.

Jul 28 '22 20:07 pbennett

It seems like http probes in kubernetes are standardized enough that it would be nice to directly support them.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes

On Thu, Jul 28, 2022, 4:44 PM Patrick Bennett @.***> wrote:

See my comments on why I would never want to do this. Things like curl, etc. - or even a shell - shouldn't be part of any production container. They provide attack vectors whether via exfiltration or most importantly, ability to import foreign code trivially and execute it.

— Reply to this email directly, view it on GitHub https://github.com/algorand/go-algorand/issues/4223#issuecomment-1198613657, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADL7T54DOCSYN5SQOZRDHLVWLWK5ANCNFSM52YEB5IA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Oct 11 '22 08:10 jannotti

Correct, readiness and liveness handlers are critical components in k8s deployments. The startup probe 'could' be used, but not all implementations honor it w/ regards to the readiness handle (GKE and its newer native container load-balancing - which only honors readiness handlers for the ingress handlers). This would be a super simple addition to algod, but also extremely useful.

Oct 11 '22 14:10 pbennett

I have actually run into this issue as well. From my experience, cloud native load balancers (outside of k8s) check for readiness through a very simple system. You define an endpoint on the VM and the load balancers checks for either a 200 (ish) response or an error/no response. I would appreciate this addition as well!

Oct 19 '22 06:10 WesleyMiller1998

What about a database schema change? They are very rare to happen but, when they do, the backend isn't up so one cannot be aware about the state.

It happened on the past, some machines took up to half an hour to complete the process and one doesn't know if the process is hung.

Dec 01 '22 12:12 mxmauro

Exactly. Same issue - catch-up or db upgrades. 👍

Dec 01 '22 12:12 pbennett

Excellent!! Thanks.

Dec 01 '22 12:12 mxmauro

go-algorand go-algorand copied to clipboard

New health endpoint should be added that only returns success if node is fully caught-up.

Problem

Solution

go-algorand
go-algorand copied to clipboard