besu
besu copied to clipboard
Always switch full sync target when local chain is close to chain head
Signed-off-by: Fabio Di Fabio [email protected]
PR description
Full sync has a sync target stability feature that works well when syncing from genesis, but is not optimal when the initial sync is done and local chian head is equal or very close to the target head, when the stability feature could prevent switching to the best peer.
By default the stability feature prevent to switch to the best peer, if its height is not 200 block greather, or the difficulty is not 1_000_000_000_000_000_000L greather than the current sync target.
These value are too large when we are already in sync and just need to follow the best peer, so this PR, changes only enable the stability feature when the current chain head is far from the target head, that occurs when doing a full sync from genesis or when you restart Besu after some hours or days.
This can help Besu to stay in sync, because if the block propagation manager is missing some blocks, then it stops caching incoming blocks when the distance with the local chain head is > 30, but since the chain state of a peer is only updated when a new block is seen from that peer, there could not be enough information for the current sync target switching strategy to switch to the best peer.
On an already synced node, there are 2 hidden configuration flags to always for the switch to the best peer, without this patch:
--Xsynchronizer-downloader-change-target-threshold-by-td=0
--Xsynchronizer-downloader-change-target-threshold-by-height=0
Fixed Issue(s)
potential fix for #3955
Documentation
- [x] I thought about documentation and added the
doc-change-required
label to this PR if updates are required.
Changelog
- [x] I thought about the changelog and included a changelog update if required.
Full sync shouldn't be kicking in when we're close to the chain head though - it's a quite heavy weight thing to have kick in when the BlockPropagationManager
should be responsible for keeping us in sync when close to head. Similarly BlockPropagationManager
should stop attempting to backfill when we're too far from the head of the chain as its then less efficient then full sync.
It is important to ensure the thresholds are set correctly so you get a seamless transfer from full sync to gossip and back, but always having full sync active isn't the right answer. It will likely be requesting block data that we either have or will backfill from gossip, increasing bandwidth usage and placing additional load on our peers which may cause them to downscore us.
@ajsutton beside this PR that after my last analysis is not helpful as I thought and will probably discard, let me clarify better that full sync already runs in parallel with block synchronization via gossip, there is nothing that tells full sync to pause while block synchronization is running, and usually full sync just does nothing if block synchronization works fine, since the latter is usually the first to see and import new blocks, and to update the peer status.
In detail, full sync checks every 5 seconds if local chain is in sync with the best peer, and since usually the best peer status is updated* by the block propagation manager after receiving the head block via gossip, full sync has nothing to do (does not starts a new pipeline) and just go to sleep for another 5. *The other way full sync can see an new better peer is upon a new connection
Observing Grafana, when block download is stalled, shows that full sync runs are zero, while usually there are some starts, see the graph below, so I am now focusing why it does not restart, and I am putting this on hold for now.
@ajsutton beside this PR that after my last analysis is not helpful as I thought and will probably discard, let me clarify better that full sync already runs in parallel with block synchronization via gossip, there is nothing that tells full sync to pause while block synchronization is running, and usually full sync just does nothing if block synchronization works fine, since the latter is usually the first to see and import new blocks, and to update the peer status.
So there's a difference here between full sync monitoring to see if it should do something and actively syncing. It's not that full sync should be completely stopped - it should keep monitoring, but it should be configured such that it doesn't kick in when the node is nearly in sync, because the block propagation manager should handle fetching blocks that are close to the chain head. With full sync kicking in as often as those graphs show, besu is likely wasting bandwidth and that's something we should revisit once we have the block propagation manager working again.
In detail, full sync checks every 5 seconds if local chain is in sync with the best peer, and since usually the best peer status is updated* by the block propagation manager after receiving the head block via gossip, full sync has nothing to do (does not starts a new pipeline) and just go to sleep for another 5. *The other way full sync can see an new better peer is upon a new connection
The idea here is that the best peer should need to be some reasonable amount ahead of our current chain (more than a few blocks) for full sync to actually activate. Otherwise it should leave it to the block propagation manager.
Not something we should change until we have fixed the current issues where it doesn't stay in sync, but I suspect we're wasting a bunch of bandwidth downloading the same block multiple times with the way it works now. May not be an issue post merge though when block gossip is stopped anyway.