graph-node
graph-node copied to clipboard
Make `graph-node` tolerate chains not being available during startup
Right now, when graph-node starts, it does some checks against each RPC endpoint; if those checks fail or time out, graph-node will mark the chain as not working and not use it anymore. That will also make all subgraphs that use that chain fail at startup. If the problem with the endpoint is transient, the only way to get graph-node to use it again is to restart it (at the danger that now some other chain has a transient issue)
The code needs to be changed such that graph-node is much more tolerant to such transient issues and automatically retries using a chain that caused trouble during startup. As part of solving the issue, we should also produce documentation that describes what is expected of an endpoint before we will use it, and a graphman command that allows checking any given endpoint by going through its startup sequence. Additionally, there needs to be some way to figure out which of the configured endpoints graph-node considers usable/not usable at any given point in time.
Sounds like a great improvement.
Please also expose this knowledge via Prometheus monitoring wherever possible/reasonable.
I just filed #4115 which seems possibly related to this. Does graph-node mark a chain as not working if a firehose provider goes down after successful startup?
In addition to toleration of chains not being available during startup, it would be amazing to tolerate chains which go down during while the graph-node is running. Even better I think would be tolerating individual RPC/Firehose providers not being available instead of marking a chain as dead if a single provider is down.
Looks like this issue has been open for 6 months with no activity. Is it still relevant? If not, please remember to close it.
#4754 should help with firehose providers, by allowing them to retry for 30 secs before giving up.
@leoyvens does #4754 also help with https://github.com/graphprotocol/graph-node/issues/4323?
EDIT: and if so, can/should we make the 30 seconds configurable?
@paymog I think it will help, but this is more targeted https://github.com/graphprotocol/graph-node/pull/4778