go-algorand
go-algorand copied to clipboard
New health endpoint should be added that only returns success if node is fully caught-up.
Problem
The current /health endpoint returns success basically if algod is running 'at all.'
However, the node really isn't capable of accepting transactions, looking up blocks, or accounts or doing much of anything until its fully caught up.
Thus a node in the middle of catch-up (fast catchup or otherwise), shouldn't be seen as being 'ready'
To run an Algorand node in a container (Kubernetes for instance), you typically set up http[s] liveness and readiness handlers. The readiness handler determines if traffic can be sent to the container. The existing /health is fine as a liveness handler, however there needs to be either a query-parameter alteration to /health, or a new endpoint to determine if the node is actually capable of accepting real requests/transactions.
Solution
I recommend a new endpoint of /ready in addition the existing /health it should return http status of 500 (there may be another code more appropriate, which is fine) if not ready, and 200 once ready.
Adding this prevents script hacks typically needed to work around this, including additional binaries in with the algorand node (like shells, programs that could be used as attack vector [curl, etc.]. Integrators can simply define their readiness handler as pointing to the /ready endpoint and it will just work. Nodes can be configured to catch-up on startup, and no traffic will be sent to them until they're fully caught up.
Could you ulse something like the v2/status
endpoint and jq to check for catchup-time
equals 0?
See my comments on why I would never want to do this. Things like curl, etc. - or even a shell - shouldn't be part of any production container. They provide attack vectors whether via exfiltration or most importantly, ability to import foreign code trivially and execute it.
It seems like http probes in kubernetes are standardized enough that it would be nice to directly support them.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
On Thu, Jul 28, 2022, 4:44 PM Patrick Bennett @.***> wrote:
See my comments on why I would never want to do this. Things like curl, etc. - or even a shell - shouldn't be part of any production container. They provide attack vectors whether via exfiltration or most importantly, ability to import foreign code trivially and execute it.
— Reply to this email directly, view it on GitHub https://github.com/algorand/go-algorand/issues/4223#issuecomment-1198613657, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADL7T54DOCSYN5SQOZRDHLVWLWK5ANCNFSM52YEB5IA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Correct, readiness and liveness handlers are critical components in k8s deployments. The startup probe 'could' be used, but not all implementations honor it w/ regards to the readiness handle (GKE and its newer native container load-balancing - which only honors readiness handlers for the ingress handlers). This would be a super simple addition to algod, but also extremely useful.
I have actually run into this issue as well. From my experience, cloud native load balancers (outside of k8s) check for readiness through a very simple system. You define an endpoint on the VM and the load balancers checks for either a 200 (ish) response or an error/no response. I would appreciate this addition as well!
What about a database schema change? They are very rare to happen but, when they do, the backend isn't up so one cannot be aware about the state.
It happened on the past, some machines took up to half an hour to complete the process and one doesn't know if the process is hung.
Exactly. Same issue - catch-up or db upgrades. 👍
Excellent!! Thanks.