zetaclient stops connecting to external chain RPCs after receiving HTTP error codes.
Describe the Bug When zetaclientd gets an HTTP error back from the BTC RPC it stops working and never retries until zetaclientd is restarted.
To Reproduce Would need to replicate the the 409 or other HTTP errors codes from the BTC RPC. It isn't clear if this is limited to BTC or an issue for the RPC connectivity for all networks.
Expected Behavior Ideally all attempts to connect to RPCs for external networks should have a backoff timer and continue to retry even if they receive errors from the external RPC server.
Screenshots If applicable, add screenshots to help explain your problem.
You can see from the logs one one validator was falling behind on BTC blocks
Both of the places were FetchUTXOs is called do not exit if error. So if this is really happening then the log messages are probably unrelated.
What we really need in cases like this is a goroutine dump via curl -o goroutine.dump http://localhost:6061/debug/pprof/goroutine?debug=2
My best guess is that one of the RPCs just hung. The bitcoin rpc client doesn't take contexts or have any configurable timeouts on calls.
https://github.com/zeta-chain/node/blob/80ca9214bf531668b0b18199f0743fa81957f3ca/zetaclient/chains/bitcoin/observer/observer.go#L308-L319
https://github.com/zeta-chain/node/blob/80ca9214bf531668b0b18199f0743fa81957f3ca/zetaclient/chains/bitcoin/signer/signer.go#L206-L213
Seems we only kept 7 days of logs in Datadog and above log prints are not tracked any more. After searching above log prints in code, the initial error was on Bitcoin RPC GetBlockCount (error getting block height), and then it failed FetchUTXOs. When FetchUTXOs fails, zetaclient won't be able to get balance and outbound will not work properly.
@gartnera As you said, the Bitcoin RPCs are designed to NOT take a context, so they should be able to resume/reconnect by themselves. The root cause is not looking clear with above logs.
The whole process was:
- We have a
watchRPCStatusgoroutine that NEVER stops for Bitcoin chain. - Every mininute, this goroutine either print
RPC latency is OKor an error message. - The last heart beat of
validator1-us-east-1-mainnetwas on16:20:43as shown in below screenshot. - The heart beat resumed after restarting zetaclient
18:58:33.
Here are my guesses after investigation into logs.
- The
watchRPCStatusgo-routine got stuck on this call blockTime, err := rpc.CheckRPCStatus(ob.btcClient, tssAddress) for 2.5 hours. - The
ObserveInboundgo-routine also got stuck, for similar reason, on this line err := ob.ObserveInbound(ctx)
The reason is that both go-routines either NEVER work silently, they print message on every iteration. To be specific,
watchRPCStatus prints RPC latency is OK or error;
ObserveInbound prints observeInboundBTC: block XXX has YYY txs
We see no logs like above after last heart beat.
On which lines the above two go-routines got stuck? A go routine stops execution only if it is blocked by a lock or IO. There is no lock used in above go-routines (no crash as well) so the IO is only what we can suspect.
In our case, the RPC status check got stuck on one of the following RPC calls in function CheckRPCStatus
GetBlockCount, GetBlockHash, GetBlockHeader, ListUnspentMinMaxAddresses.
The inbound go routine got stuck on one of the two lines (not sure which one):
currentBlock, err := ob.btcClient.GetBlockCount()at line 76 that invokesGetBlockCount.res, err := ob.GetBlockByNumberCached(int64(blockNumber))at line 108, which invokesGetBlockHash, GetBlockHeader, GetBlockVerboseTxinternally.
Restarting the program solved the RPC errors was a interesting hint. The 409 error indicates a request conflict with the current state of the target resource and was resolved by a restart?
The link to Grafana logs
What's a good way to avoid this in the future? We need to ensure zetaclient can handle unusual responses from RPCs without going offline.
Something like a RPC heartbeat fails to work after 5 minutes restart the go-routine for that network? If we can't handle the error directly when it happens.
This feels to restarting the zetaclient reset/released a process-level resource (don't know what was it), therefor solved the RPC hung problem. The above issue was like (for instance):
zetaclient go routine calls GetBlockCount and the call just got stuck right on the line and never returned. So the program execution was unexpectedly hung (the Bitcoin RPCs were designed return, without a context). A remedy to this would be external and operational in stead of self diagnosis inside zetaclient.
We would leave this issue under monitoring for a while before taking effective actions so we can collect and identify some patterns/clues of the errors.
What's a good way to avoid this in the future? We need to ensure zetaclient can handle unusual responses from RPCs without going offline.
- Completely reimplement the bitcoin rpc such that it takes contexts and properly returns on timeout/context cancelation.
- Add an in process watchdog that must be bumped every N seconds. If it is not bumped the process would automatically restart.
Medium to high effort
Completely reimplement the Bitcoin RPC such that it takes contexts and properly returns on timeout/context cancellation.
I agree, this is the only reliable way of solving this.
We need to write a wrapper around github.com/btcsuite/btcd/btcjson that uses the same DTOs but has proper HTTP & ctx support. Only used methods should be implemented, no need to code all supported RPC methods.
We can also leverage this moment to bring proper retries, logs, and Prometheus metrics.
UPD: Inspected btcjson, it shouldn't be hard to implement our own client.
btcjson.MarshalCmd(...) marshals types into {"jsonrpc":"1.0","method":"...","params":["..."],"id":1} rpc calls
@gartnera I agree with the short/medium term to setup watchdog to automatically restart. Let's monitor how frequently the issue would happen moving forward and find some patterns. If necessary, re-implementing (or fork and add context) Bitcoin RPC is always not a bad idea if RPC calls hang a lot.
The changes recommned have already been implemented https://github.com/zeta-chain/zeta-node/blob/ab691b14cbe37a34f5dee66a3f690ad10e14d2e3/zetaclient/chains/bitcoin/client/client.go#L116-L137
We do not automatcially retry all calls if there is an hhtp error , releated to the PRC though . Monitoreing that the RPC is healthy is part of a separate healthcheck routine which can be monitered via metrics https://github.com/zeta-chain/zeta-node/blob/ab691b14cbe37a34f5dee66a3f690ad10e14d2e3/zetaclient/chains/bitcoin/observer/observer.go#L291-L304
in my opinion the problem stated here has been mititgated , but the approach is different from orginally proposed , as long as we ar monitering the healthcheck metrics of the rpcs we should be okay handling errors