lodestar backup eth1 node provider failover does not actually work?

Describe the bug When a the primary eth1 node goes down and a second eth1 node begins serving requests, the logs get littered with messages as follows:

error: Error updating eth1 chain cache code=ETH1_ERROR_NON_CONSECUTIVE_LOGS, newIndex=123809, prevIndex=90472

Error: ETH1_ERROR_NON_CONSECUTIVE_LOGS
    at Eth1DepositsCache.add (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositsCache.ts:48:15)
    at Eth1DepositDataTracker.updateDepositCache (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositDataTracker.ts:174:5)
    at Eth1DepositDataTracker.update (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositDataTracker.ts:155:33)
    at Eth1DepositDataTracker.runAutoUpdate (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositDataTracker.ts:129:29)

Expected behavior No errors (and no DB corruption?)

Steps to Reproduce

Have a fully synced node (unsure if this is required)
Specify two eth1 nodes with --eth1.providerUrls
Take the first of the two nodes offline and have it fall back (can be verified by seeing json rpc requests on the secondary)

Screenshots n/a

Desktop (please complete the following information):

OS: Ubuntu 20.04 LTS
Version: chainsafe/lodestar:v0.34.1 via docker
Branch: n/a
Commit hash: n/a

Mar 10 '22 19:03 timothysu

@g11tech can you take a look?

Mar 14 '22 03:03 dapplion

@dapplion :+1:

Mar 14 '22 11:03 g11tech

Marking as HIGH priority since this issue can potentially lead to proposal errors in un-resolved before proposing

May 10 '22 12:05 dapplion

somehow there is a gap between new deposit index and old deposit index, this is strange because we always based on highest deposit event block number before fetching deposit events

if we prioritize to work on this in a Sprint, need to prepare 2 public eth1 nodes to reproduce the issue

Dec 27 '22 07:12 twoeths

somehow there is a gap between new deposit index and old deposit index, this is strange because we always based on highest deposit event block number before fetching deposit events

if we prioritize to work on this in a Sprint, need to prepare 2 public eth1 nodes to reproduce the issue

Would you be able to test against with some of the rescue nodes we have setup for production @tuyennhv ? I believe we have two from two different providers available.

Dec 28 '22 21:12 philknows

I have a branch (tuyen/eth1_use_fallback_url) to switch between 2 different eth1 provider urls every 5 minutes and it still can fetch deposits successfully (this is on mainnet)

also the log does not show the error in this issue

grep -e "ETH1_ERROR_NON_CONSECUTIVE_LOGS" -rn beacon-2023-01-02.log

grep -e "Error updating eth1 chain" -rn beacon-2023-01-02.log

since this issue was open for a while and code changed, I suppose we don't have it anymore.

@timothysu if you can reproduce, feel free to reopen. Thanks.

Jan 02 '23 12:01 twoeths

lodestar lodestar copied to clipboard

backup eth1 node provider failover does not actually work?

lodestar
lodestar copied to clipboard