node icon indicating copy to clipboard operation
node copied to clipboard

zetaclient stops connecting to external chain RPCs after receiving HTTP error codes.

Open CharlieMc0 opened this issue 11 months ago • 8 comments

Describe the Bug When zetaclientd gets an HTTP error back from the BTC RPC it stops working and never retries until zetaclientd is restarted.

To Reproduce Would need to replicate the the 409 or other HTTP errors codes from the BTC RPC. It isn't clear if this is limited to BTC or an issue for the RPC connectivity for all networks.

Expected Behavior Ideally all attempts to connect to RPCs for external networks should have a backoff timer and continue to retry even if they receive errors from the external RPC server.

Screenshots If applicable, add screenshots to help explain your problem.

Screenshot 2024-12-29 at 6 07 06 PM

You can see from the logs one one validator was falling behind on BTC blocks Screenshot 2024-12-29 at 6 07 32 PM Screenshot 2024-12-29 at 6 11 04 PM

CharlieMc0 avatar Dec 30 '24 01:12 CharlieMc0

Both of the places were FetchUTXOs is called do not exit if error. So if this is really happening then the log messages are probably unrelated.

What we really need in cases like this is a goroutine dump via curl -o goroutine.dump http://localhost:6061/debug/pprof/goroutine?debug=2

My best guess is that one of the RPCs just hung. The bitcoin rpc client doesn't take contexts or have any configurable timeouts on calls.

https://github.com/zeta-chain/node/blob/80ca9214bf531668b0b18199f0743fa81957f3ca/zetaclient/chains/bitcoin/observer/observer.go#L308-L319

https://github.com/zeta-chain/node/blob/80ca9214bf531668b0b18199f0743fa81957f3ca/zetaclient/chains/bitcoin/signer/signer.go#L206-L213

gartnera avatar Jan 02 '25 18:01 gartnera

Seems we only kept 7 days of logs in Datadog and above log prints are not tracked any more. After searching above log prints in code, the initial error was on Bitcoin RPC GetBlockCount (error getting block height), and then it failed FetchUTXOs. When FetchUTXOs fails, zetaclient won't be able to get balance and outbound will not work properly.

@gartnera As you said, the Bitcoin RPCs are designed to NOT take a context, so they should be able to resume/reconnect by themselves. The root cause is not looking clear with above logs.

ws4charlie avatar Jan 06 '25 16:01 ws4charlie

The whole process was:

  • We have a watchRPCStatus goroutine that NEVER stops for Bitcoin chain.
  • Every mininute, this goroutine either print RPC latency is OK or an error message.
  • The last heart beat of validator1-us-east-1-mainnet was on 16:20:43 as shown in below screenshot.
  • The heart beat resumed after restarting zetaclient 18:58:33.

Here are my guesses after investigation into logs.

  1. The watchRPCStatus go-routine got stuck on this call blockTime, err := rpc.CheckRPCStatus(ob.btcClient, tssAddress) for 2.5 hours.
  2. The ObserveInbound go-routine also got stuck, for similar reason, on this line err := ob.ObserveInbound(ctx)

The reason is that both go-routines either NEVER work silently, they print message on every iteration. To be specific, watchRPCStatus prints RPC latency is OK or error; ObserveInbound prints observeInboundBTC: block XXX has YYY txs We see no logs like above after last heart beat.

On which lines the above two go-routines got stuck? A go routine stops execution only if it is blocked by a lock or IO. There is no lock used in above go-routines (no crash as well) so the IO is only what we can suspect.

In our case, the RPC status check got stuck on one of the following RPC calls in function CheckRPCStatus GetBlockCount, GetBlockHash, GetBlockHeader, ListUnspentMinMaxAddresses.

The inbound go routine got stuck on one of the two lines (not sure which one):

  • currentBlock, err := ob.btcClient.GetBlockCount() at line 76 that invokes GetBlockCount.
  • res, err := ob.GetBlockByNumberCached(int64(blockNumber)) at line 108, which invokes GetBlockHash, GetBlockHeader, GetBlockVerboseTx internally.

Restarting the program solved the RPC errors was a interesting hint. The 409 error indicates a request conflict with the current state of the target resource and was resolved by a restart?

The link to Grafana logs image

ws4charlie avatar Jan 06 '25 21:01 ws4charlie

What's a good way to avoid this in the future? We need to ensure zetaclient can handle unusual responses from RPCs without going offline.

Something like a RPC heartbeat fails to work after 5 minutes restart the go-routine for that network? If we can't handle the error directly when it happens.

CharlieMc0 avatar Jan 06 '25 21:01 CharlieMc0

This feels to restarting the zetaclient reset/released a process-level resource (don't know what was it), therefor solved the RPC hung problem. The above issue was like (for instance): zetaclient go routine calls GetBlockCount and the call just got stuck right on the line and never returned. So the program execution was unexpectedly hung (the Bitcoin RPCs were designed return, without a context). A remedy to this would be external and operational in stead of self diagnosis inside zetaclient.

We would leave this issue under monitoring for a while before taking effective actions so we can collect and identify some patterns/clues of the errors.

ws4charlie avatar Jan 06 '25 22:01 ws4charlie

What's a good way to avoid this in the future? We need to ensure zetaclient can handle unusual responses from RPCs without going offline.

  1. Completely reimplement the bitcoin rpc such that it takes contexts and properly returns on timeout/context cancelation.
  2. Add an in process watchdog that must be bumped every N seconds. If it is not bumped the process would automatically restart.

Medium to high effort

gartnera avatar Jan 06 '25 23:01 gartnera

Completely reimplement the Bitcoin RPC such that it takes contexts and properly returns on timeout/context cancellation.

I agree, this is the only reliable way of solving this.

We need to write a wrapper around github.com/btcsuite/btcd/btcjson that uses the same DTOs but has proper HTTP & ctx support. Only used methods should be implemented, no need to code all supported RPC methods.

We can also leverage this moment to bring proper retries, logs, and Prometheus metrics.


UPD: Inspected btcjson, it shouldn't be hard to implement our own client.

btcjson.MarshalCmd(...) marshals types into {"jsonrpc":"1.0","method":"...","params":["..."],"id":1} rpc calls

swift1337 avatar Jan 07 '25 11:01 swift1337

@gartnera I agree with the short/medium term to setup watchdog to automatically restart. Let's monitor how frequently the issue would happen moving forward and find some patterns. If necessary, re-implementing (or fork and add context) Bitcoin RPC is always not a bad idea if RPC calls hang a lot.

ws4charlie avatar Jan 07 '25 19:01 ws4charlie

The changes recommned have already been implemented https://github.com/zeta-chain/zeta-node/blob/ab691b14cbe37a34f5dee66a3f690ad10e14d2e3/zetaclient/chains/bitcoin/client/client.go#L116-L137

We do not automatcially retry all calls if there is an hhtp error , releated to the PRC though . Monitoreing that the RPC is healthy is part of a separate healthcheck routine which can be monitered via metrics https://github.com/zeta-chain/zeta-node/blob/ab691b14cbe37a34f5dee66a3f690ad10e14d2e3/zetaclient/chains/bitcoin/observer/observer.go#L291-L304

in my opinion the problem stated here has been mititgated , but the approach is different from orginally proposed , as long as we ar monitering the healthcheck metrics of the rpcs we should be okay handling errors

kingpinXD avatar Oct 28 '25 17:10 kingpinXD