Pause sync when execution layer is offline
Issue Addressed
Resolves #3032
Proposed Changes
This PR fixes 2 issues with post-merge sync:
-
Execution engine offline errors were treated similar to other
BeaconChainErrors in range sync which caused peers to get downscored and kicked for potentially no fault of theirs. This PR pausesRangeSyncwhen the EE is offline and only resumes after the EE is back online. It doesn't do any leftover block processing until the EE is back online and hence doesn't disconnect from any connected peers. It polls the EE to check if it's online every 5 seconds and resumes the chains once it is back online. -
If the EE goes offline when the BN is fully synced, processing parent chain lookups returns EE errors which causes the parent chain to get added to the
failed_chainslist .Any subsequent request for the same parent chain gets ignored. Here, even a single slot EE outage causes a full epoch of beacon chain outage because failed chains aren't processed until range sync kicks in after an epoch. To fix this, we don't add the parent lookup to the list offailed_chainson EE offline errors.
Additional info
Currently rebased on #3036 until it gets merged to unstable.
So I skimmed now the parent lookup part and have a couple of questions
Wasn't really able to tell the difference between waiting_blocks and waiting_execution since both seem to have a copy of blocks currently processing
Also, until now we would only send a parent chain for processing when a block was already known and here we seem to do it much more often, for example when we get a single block lookup that matches with the chain from the first block. We need to match the chain with the db on the last block so I'm not sure why we do this
This guy is finally ready for review now! :tada:
I'm using similar approaches for both block lookups and range sync. i.e. pause sync when block verification fails because of execution layer being offline. Sync is resumed in both cases when the execution layer comes back online and sends a notification to the sync manager using a channel.
This felt like the easiest way to incorporate execution layer going offline scenarios without changing the existing sync code too much.
I'm dropping the v2.5.0 tag after a chat with @AgeManning. I think it's totally fine that this isn't in v2.5.0 and I wouldn't like to rush this or delay the release :relaxed:
Closing this in favor of #3428