lodestar icon indicating copy to clipboard operation
lodestar copied to clipboard

backup eth1 node provider failover does not actually work?

Open timothysu opened this issue 2 years ago • 3 comments

Describe the bug When a the primary eth1 node goes down and a second eth1 node begins serving requests, the logs get littered with messages as follows:

error: Error updating eth1 chain cache code=ETH1_ERROR_NON_CONSECUTIVE_LOGS, newIndex=123809, prevIndex=90472

Error: ETH1_ERROR_NON_CONSECUTIVE_LOGS
    at Eth1DepositsCache.add (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositsCache.ts:48:15)
    at Eth1DepositDataTracker.updateDepositCache (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositDataTracker.ts:174:5)
    at Eth1DepositDataTracker.update (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositDataTracker.ts:155:33)
    at Eth1DepositDataTracker.runAutoUpdate (/usr/app/node_modules/@chainsafe/lodestar/src/eth1/eth1DepositDataTracker.ts:129:29)

Expected behavior No errors (and no DB corruption?)

Steps to Reproduce

  1. Have a fully synced node (unsure if this is required)
  2. Specify two eth1 nodes with --eth1.providerUrls
  3. Take the first of the two nodes offline and have it fall back (can be verified by seeing json rpc requests on the secondary)

Screenshots n/a

Desktop (please complete the following information):

  • OS: Ubuntu 20.04 LTS
  • Version: chainsafe/lodestar:v0.34.1 via docker
  • Branch: n/a
  • Commit hash: n/a

timothysu avatar Mar 10 '22 19:03 timothysu

@g11tech can you take a look?

dapplion avatar Mar 14 '22 03:03 dapplion

@dapplion :+1:

g11tech avatar Mar 14 '22 11:03 g11tech

Marking as HIGH priority since this issue can potentially lead to proposal errors in un-resolved before proposing

dapplion avatar May 10 '22 12:05 dapplion

somehow there is a gap between new deposit index and old deposit index, this is strange because we always based on highest deposit event block number before fetching deposit events

if we prioritize to work on this in a Sprint, need to prepare 2 public eth1 nodes to reproduce the issue

twoeths avatar Dec 27 '22 07:12 twoeths

somehow there is a gap between new deposit index and old deposit index, this is strange because we always based on highest deposit event block number before fetching deposit events

if we prioritize to work on this in a Sprint, need to prepare 2 public eth1 nodes to reproduce the issue

Would you be able to test against with some of the rescue nodes we have setup for production @tuyennhv ? I believe we have two from two different providers available.

philknows avatar Dec 28 '22 21:12 philknows

I have a branch (tuyen/eth1_use_fallback_url) to switch between 2 different eth1 provider urls every 5 minutes and it still can fetch deposits successfully (this is on mainnet)

Screen Shot 2023-01-02 at 19 07 54

also the log does not show the error in this issue

grep -e "ETH1_ERROR_NON_CONSECUTIVE_LOGS" -rn beacon-2023-01-02.log
grep -e "Error updating eth1 chain" -rn beacon-2023-01-02.log

since this issue was open for a while and code changed, I suppose we don't have it anymore.

@timothysu if you can reproduce, feel free to reopen. Thanks.

twoeths avatar Jan 02 '23 12:01 twoeths