lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Pause sync when execution layer is offline

Open pawanjay176 opened this issue 3 years ago • 3 comments

Issue Addressed

Resolves #3032

Proposed Changes

This PR fixes 2 issues with post-merge sync:

  1. Execution engine offline errors were treated similar to other BeaconChainErrors in range sync which caused peers to get downscored and kicked for potentially no fault of theirs. This PR pauses RangeSync when the EE is offline and only resumes after the EE is back online. It doesn't do any leftover block processing until the EE is back online and hence doesn't disconnect from any connected peers. It polls the EE to check if it's online every 5 seconds and resumes the chains once it is back online.

  2. If the EE goes offline when the BN is fully synced, processing parent chain lookups returns EE errors which causes the parent chain to get added to the failed_chains list .Any subsequent request for the same parent chain gets ignored. Here, even a single slot EE outage causes a full epoch of beacon chain outage because failed chains aren't processed until range sync kicks in after an epoch. To fix this, we don't add the parent lookup to the list of failed_chains on EE offline errors.

Additional info

Currently rebased on #3036 until it gets merged to unstable.

pawanjay176 avatar Mar 16 '22 13:03 pawanjay176

So I skimmed now the parent lookup part and have a couple of questions Wasn't really able to tell the difference between waiting_blocks and waiting_execution since both seem to have a copy of blocks currently processing Also, until now we would only send a parent chain for processing when a block was already known and here we seem to do it much more often, for example when we get a single block lookup that matches with the chain from the first block. We need to match the chain with the db on the last block so I'm not sure why we do this

divagant-martian avatar Apr 06 '22 16:04 divagant-martian

This guy is finally ready for review now! :tada:

I'm using similar approaches for both block lookups and range sync. i.e. pause sync when block verification fails because of execution layer being offline. Sync is resumed in both cases when the execution layer comes back online and sends a notification to the sync manager using a channel.

This felt like the easiest way to incorporate execution layer going offline scenarios without changing the existing sync code too much.

pawanjay176 avatar May 20 '22 19:05 pawanjay176

I'm dropping the v2.5.0 tag after a chat with @AgeManning. I think it's totally fine that this isn't in v2.5.0 and I wouldn't like to rush this or delay the release :relaxed:

paulhauner avatar Jul 27 '22 04:07 paulhauner

Closing this in favor of #3428

pawanjay176 avatar Aug 17 '22 20:08 pawanjay176