lodestar
lodestar copied to clipboard
backup eth1 node provider failover does not actually work?
Describe the bug When a the primary eth1 node goes down and a second eth1 node begins serving requests, the logs get littered with messages as follows:
error: Error updating eth1 chain cache code=ETH1_ERROR_NON_CONSECUTIVE_LOGS, newIndex=123809, prevIndex=90472
Error: ETH1_ERROR_NON_CONSECUTIVE_LOGS
at Eth1DepositsCache.add (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositsCache.ts:48:15)
at Eth1DepositDataTracker.updateDepositCache (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositDataTracker.ts:174:5)
at Eth1DepositDataTracker.update (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositDataTracker.ts:155:33)
at Eth1DepositDataTracker.runAutoUpdate (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositDataTracker.ts:129:29)
Expected behavior No errors (and no DB corruption?)
Steps to Reproduce
- Have a fully synced node (unsure if this is required)
- Specify two eth1 nodes with
--eth1.providerUrls
- Take the first of the two nodes offline and have it fall back (can be verified by seeing json rpc requests on the secondary)
Screenshots n/a
Desktop (please complete the following information):
- OS: Ubuntu 20.04 LTS
- Version:
chainsafe/lodestar:v0.34.1
via docker - Branch: n/a
- Commit hash: n/a
@g11tech can you take a look?
@dapplion :+1:
Marking as HIGH priority since this issue can potentially lead to proposal errors in un-resolved before proposing
somehow there is a gap between new deposit index and old deposit index, this is strange because we always based on highest deposit event block number before fetching deposit events
if we prioritize to work on this in a Sprint, need to prepare 2 public eth1 nodes to reproduce the issue
somehow there is a gap between new deposit index and old deposit index, this is strange because we always based on highest deposit event block number before fetching deposit events
if we prioritize to work on this in a Sprint, need to prepare 2 public eth1 nodes to reproduce the issue
Would you be able to test against with some of the rescue nodes we have setup for production @tuyennhv ? I believe we have two from two different providers available.
I have a branch (tuyen/eth1_use_fallback_url
) to switch between 2 different eth1 provider urls every 5 minutes and it still can fetch deposits successfully (this is on mainnet)

also the log does not show the error in this issue
grep -e "ETH1_ERROR_NON_CONSECUTIVE_LOGS" -rn beacon-2023-01-02.log
grep -e "Error updating eth1 chain" -rn beacon-2023-01-02.log
since this issue was open for a while and code changed, I suppose we don't have it anymore.
@timothysu if you can reproduce, feel free to reopen. Thanks.