libplanet
libplanet copied to clipboard
Optimize block syncing by preventing unconditional task spawning
[14:27:19 DBG] Preparing a Blocks message to reply to 00395f9250...
[14:27:19 DBG] Fetching block 1/500 61076a067105a850d07c0bbed175d018f76ed5dc29b83173556bbfe376257015 to include in a reply to 00395f9250...
[14:27:19 DBG] Fetching block 156/500 4cc695fab945d91928c16abb0349dade9b53b75264219fa50fd8c52d5382864c to include in a reply to 00125f9250...
[14:27:19 DBG] Fetching block 137/500 39efb21dee81b4ce5b2dfa54a9b3f21f71f9e895916fde620a3396fc5d05ab01 to include in a reply to 001a5f9250...
[14:27:19 DBG] Fetching block 128/500 523f055923bbb40f24be2ec50b0dc29f85ea250a156d4d222ac52a74a58a1b8c to include in a reply to 00255f9250...
Above is a sample log from running two nodes with relatively high blockchain height difference together with short block intervals.
Let's say there is node A currently with blockchain height 100 and node B with blockchain height 0. As node A mines block of index n the following happens:
- Node A broadcasts
BlockHeaderMessagefor aBlock<T>of indexn. - Node B receives
BlockHeaderMessage, determines it needs to sync up, sendsGetBlockHashes. - Node A receives
GetBlockHashes, replies withBlockHashesto node B. - Node B requests
Block<T>s to node A withGetBlocks. - Node A replies with over 100
Block<T>s.
If all the steps can be performed in a relatively short cycle, in particular, a lot less than standard block interval, there wouldn't be much of a problem. However, if this isn't the case, then multiple tasks get spawned on both sides.
That is, for example, while node B is trying to download 101 Block<T>s from node A, as node B receives a notification of a new Block<T>, it asks for 102 Block<T>s from node A again. The downloading tasks of getting 101 Block<T>s and 102 Block<T>s for node B and the transferring tasks of fetching and replying those Block<T>s are handled completely independently and starts to clog up computing resources on both ends (more on node A actually, which is a bigger problem). I suspect node B can ask for the same set of Block<T>s repeatedly, i.e. ask for 101 Block<T>s in the scenario above, and node A will try to unconditionally fetch the Block<T>s over and over.
To resolve this properly, fixes should be made on two fronts:
- Node B should not try to send
GetBlockHashes, and thusGetBlocks, to the same peer if a syncing process is already in the pipeline. - Node A should ignore
GetBlocksfrom the same peer if a transfering process is already in the pipeline.
I suspect the former to be easier to fix and the latter to be harder. However, as an external peer can overwhelm a running node, the latter fix is more critical.