libplanet icon indicating copy to clipboard operation
libplanet copied to clipboard

Optimize block syncing by preventing unconditional task spawning

Open greymistcube opened this issue 3 years ago • 0 comments

[14:27:19 DBG] Preparing a Blocks message to reply to 00395f9250...
[14:27:19 DBG] Fetching block 1/500 61076a067105a850d07c0bbed175d018f76ed5dc29b83173556bbfe376257015 to include in a reply to 00395f9250...
[14:27:19 DBG] Fetching block 156/500 4cc695fab945d91928c16abb0349dade9b53b75264219fa50fd8c52d5382864c to include in a reply to 00125f9250...
[14:27:19 DBG] Fetching block 137/500 39efb21dee81b4ce5b2dfa54a9b3f21f71f9e895916fde620a3396fc5d05ab01 to include in a reply to 001a5f9250...
[14:27:19 DBG] Fetching block 128/500 523f055923bbb40f24be2ec50b0dc29f85ea250a156d4d222ac52a74a58a1b8c to include in a reply to 00255f9250...

Above is a sample log from running two nodes with relatively high blockchain height difference together with short block intervals.

Let's say there is node A currently with blockchain height 100 and node B with blockchain height 0. As node A mines block of index n the following happens:

  1. Node A broadcasts BlockHeaderMessage for a Block<T> of index n.
  2. Node B receives BlockHeaderMessage, determines it needs to sync up, sends GetBlockHashes.
  3. Node A receives GetBlockHashes, replies with BlockHashes to node B.
  4. Node B requests Block<T>s to node A with GetBlocks.
  5. Node A replies with over 100 Block<T>s.

If all the steps can be performed in a relatively short cycle, in particular, a lot less than standard block interval, there wouldn't be much of a problem. However, if this isn't the case, then multiple tasks get spawned on both sides.

That is, for example, while node B is trying to download 101 Block<T>s from node A, as node B receives a notification of a new Block<T>, it asks for 102 Block<T>s from node A again. The downloading tasks of getting 101 Block<T>s and 102 Block<T>s for node B and the transferring tasks of fetching and replying those Block<T>s are handled completely independently and starts to clog up computing resources on both ends (more on node A actually, which is a bigger problem). I suspect node B can ask for the same set of Block<T>s repeatedly, i.e. ask for 101 Block<T>s in the scenario above, and node A will try to unconditionally fetch the Block<T>s over and over.

To resolve this properly, fixes should be made on two fronts:

  • Node B should not try to send GetBlockHashes, and thus GetBlocks, to the same peer if a syncing process is already in the pipeline.
  • Node A should ignore GetBlocks from the same peer if a transfering process is already in the pipeline.

I suspect the former to be easier to fix and the latter to be harder. However, as an external peer can overwhelm a running node, the latter fix is more critical.

greymistcube avatar Jun 30 '22 02:06 greymistcube