ergo icon indicating copy to clipboard operation
ergo copied to clipboard

Need simple API call to verify if node is "Active Synchronization" or "Node is synced"

Open vinnielima opened this issue 2 years ago • 9 comments

In summary, need an API call such as "/healthcheck" that provides a simple text response that reflects the status of a node synchronization with the blockchain.

I operate over 14 nodes globally for GetBlok.io. It is unsurmountable that at this maturity of the reference client, it is still consistently running into synchronization problems with the blockchain. When operating a global mining pool, we cannot afford to have a node fall out of sync.

Typically, this type of failure in the infrastructure is handled through health probes. The problem is that the Ergo reference client node does not facilitate such checks by industry standard load balancers.

Most applications offer an API endpoint that simply returns a text indicating a such condition (OK, UNHEALTHY, etc). This is because most GLB/ELBs do not provide complex mechanisms for calculations (such as current height vs blockchain height). It simply provides an ability to check for 404, 200, etc HTTP return codes, as well as reasonable parsing of the returned body of the GET request.

As many know, you cannot assume your Ergo node is health by simply being "online".

Thus, the request here is simple: Provide an unauthenticated API endpoint call that reports if a node is synchronized, or actively synchronizing. This will allow many pool operators to perform such healthcheck and pull a node out of a load balanced back end.

vinnielima avatar Jun 27 '22 00:06 vinnielima

Recently maxPeerHeight was introduced in /info . It contains max height of node neighbours. You can use it, and the node can use it to conclude about being out-of-sync. However, reported height can be faked (by generating low-work headers), so now we need to somehow take work spent on headers into account. Then the syncing protocol can be improved a lot.

This PR may help in some cases https://github.com/ergoplatform/ergo/pull/1750 , however, this is just a temporary work-around. We started on reworking mempool (to introduce incoming txs throttling based on cumulative validation cost) as well.

kushti avatar Jun 27 '22 15:06 kushti

While what you are suggesting are ways to address nodes running out of sync, it does not help for telemetry and synthetic monitoring.

Until the node synchronization can remain stable (even after such condition exists in my opinion), the ask is that there is a simple API endpoint that presents a boolean result of it's health status - Synchronized, Synchronizing, etc (however the language, it needs to be simple text). Thus I proposed the "/healthcheck" which should return the state of synchronization in simple text.

As I mentioned above, most Load Balancers do not allow for simple nor complex calculations, where maxPeerHeight can be taken into account. Only ability to perform simple text regex of the result of a GET call.

vinnielima avatar Jun 27 '22 17:06 vinnielima

@vinnielima could we add a new field to /peers/status ? Or this is too complex for load balancers as well ?

kushti avatar Jun 27 '22 21:06 kushti

GET /info and if /lH.*(\d+),[\s\S]*P.*\1,/gm matches, fullHeight is equal to maxPeerHeight, which would mean the node is synced

jellymlg avatar Jun 27 '22 21:06 jellymlg

@jellymlg Well that's some kungfu regex, based on the above regex against a properly sync'ed node:

image

What is unclear is what are you expecting to match (or not match, in this case) in an unsync'ed node situation? missing some of those key/value pairs? Certain load balancers can perform regex although a little more processing than I'd like, i'll take what I can get.

@kushti to your question - querying /peer/status returns values which would have to be inferred/interpreted by load balancer, which is not what it can really do:

{ "lastIncomingMessage": 1656365850283, "currentSystemTime": 1656365854007 }

The above would require LB to have some understanding of how to interpret lastIncomingMessage in context to current time.

vinnielima avatar Jun 27 '22 21:06 vinnielima

The matching is based on the fact that the number after fullHeight is the same as the number after maxPeerHeight. And these key-value pairs are never missing; they are null if the node is not loaded or not connected to any peers, or different if the node is unsynced.

jellymlg avatar Jun 27 '22 22:06 jellymlg

Fantastic, I will test and report back

vinnielima avatar Jun 29 '22 13:06 vinnielima

Could a flag not be introduced? The node already has the details on the /panel when its 100% synced.

I'd imagine it would be useful if wallets etc start to call nodes direct - would be handy for them from their list of node candidates can filter out ones that are not fully synced on blocks via some true/false value?

HelixEvo avatar Aug 15 '22 08:08 HelixEvo

IMO, an endpoint on the node /healthy that would compare fullHeight and maxPeerHeight and if it's MORE THAN 1 block difference, or if it's not the same after checking a few times 30 second timeout kind of thing? ( it's different while it's waiting to get new block, so we don't want to generate alerts all the time ) fail it, would be nice. Another thing that can also be done, is the node can just check it's health every 15 seconds for example, and then after 2 consecutive comparisons that are the same within a 15 second timeframe, some variable "isHealthy" is set to true or false, and then the /healthy endpoint just returns that variable. Not sure if this is best way to implement it and if it's going to add more load to the node, but seems very light operation anyways.

I'm assuming that if node loses connectivity to all nodes, maxPeerHeight is not going to stay static at the same height as the node? Because if it does, then we will need to add some sort of connectivity checker to other nodes as well, like the { "lastIncomingMessage": 1656365850283, "currentSystemTime": 1656365854007 } comparison for example.

As per @jellymlg "they are null if the node is not loaded or not connected to any peers" I'm going to assume that means maxPeerHeight will become null as soon as node loses connectivity then? to at least 1 node? and it's not that null is only there on node startup?

Then there is the fact, that /healthy could be misleading as well. A node that is syncing from genesis, but not on tip, is that considered an "unhealthy" node? Maybe the endpoint should be just called /synced ?

reqlez avatar Feb 19 '23 19:02 reqlez