rusk
rusk copied to clipboard
node: Handle a race condition between `request for a block` and `block acceptance` procedure.
Describe the bug
It has been discovered that there could be a timing conflict between the flood_request
handling and block acceptance processes. Essentially, while a node is handling a GetResource(a.k.a GetData) request for a block A
, it might simultaneously be in the process of accepting the same block A
, resulting in the failure of the GetResource
request.
To Reproduce
The scenario has been reproduced multiple times in localnet
where network latency is zero however this is supposed to be a valid scenario in testnet network especially if EST(accept/finalize) call takes longer.
Expected behaviour If a node knows the requested block (e.g it may be stored in Candidates CF ), it should be provided. The sooner it's provided, the more synchronized the network will become.
Logs/Screenshot
/// NODE_4 broadcasts the request:
{"timestamp":"2024-05-09T09:06:52.456742Z","level":"INFO","event":"flood_request block","req":"BlockRequest::ByHash(196b889f51ceca54...d78c685493de136c)","target":"node::chain::fsm"}
/// NODE_3 could not find the block
{"timestamp":"2024-05-09T09:06:52.458299Z","level":"INFO","message":"handle_get_resource called max_entries =100, fetch_block GetResource { inventory: Inv { inv_list: [InvVect { inv_type: BlockFromHash, param: Hash([25, 107, 136, 159, 81, 206, 202, 84, 95, 144, 86, 146, 87, 90, 131, 254, 19, 219, 6, 10, 141, 169, 241, 118, 215, 140, 104, 84, 147, 222, 19, 108]) }] }, requester_addr: 127.0.0.1:7004, ttl_as_sec: 1715245622 }","target":"node::databroker"}
{"timestamp":"2024-05-09T09:06:52.458331Z","level":"INFO","message":"handle_get_resource couldn't found","target":"node::databroker"}
/// NODE_2 could not find the block
{"timestamp":"2024-05-09T09:06:52.458724Z","level":"INFO","message":"handle_get_resource called max_entries =100, fetch_block GetResource { inventory: Inv { inv_list: [InvVect { inv_type: BlockFromHash, param: Hash([25, 107, 136, 159, 81, 206, 202, 84, 95, 144, 86, 146, 87, 90, 131, 254, 19, 219, 6, 10, 141, 169, 241, 118, 215, 140, 104, 84, 147, 222, 19, 108]) }] }, requester_addr: 127.0.0.1:7004, ttl_as_sec: 1715245622 }","target":"node::databroker"}
{"timestamp":"2024-05-09T09:06:52.458747Z","level":"INFO","message":"handle_get_resource couldn't found","target":"node::databroker"}
/// NODE_1 could not find the block
{"timestamp":"2024-05-09T09:06:52.459559Z","level":"INFO","message":"handle_get_resource called max_entries =100, fetch_block GetResource { inventory: Inv { inv_list: [InvVect { inv_type: BlockFromHash, param: Hash([25, 107, 136, 159, 81, 206, 202, 84, 95, 144, 86, 146, 87, 90, 131, 254, 19, 219, 6, 10, 141, 169, 241, 118, 215, 140, 104, 84, 147, 222, 19, 108]) }] }, requester_addr: 127.0.0.1:7004, ttl_as_sec: 1715245622 }","target":"node::databroker"}
{"timestamp":"2024-05-09T09:06:52.459600Z","level":"INFO","message":"handle_get_resource couldn't found","target":"node::databroker"}
/// NODES do accept the block a few milliseconds later
/tmp/rust-harness.lTtg/node_0.log:{"timestamp":"2024-05-09T09:06:52.513942Z","level":"INFO","event":"block accepted","height":590,"iter":0,"hash":"196b889f51ceca54...d78c685493de136c","txs":0,"state_hash":"9854b9aeee6a3745...2373f8e05976690c","fsv_bitset":3,"ssv_bitset":3,"block_time":1,"generator":"rdHa4H5p1rVtQyqD","dur_ms":"12","label":"Final","ffr":false,"target":"node::chain::acceptor"}
/tmp/rust-harness.lTtg/node_1.log:{"timestamp":"2024-05-09T09:06:52.530574Z","level":"INFO","event":"block accepted","height":590,"iter":0,"hash":"196b889f51ceca54...d78c685493de136c","txs":0,"state_hash":"9854b9aeee6a3745...2373f8e05976690c","fsv_bitset":3,"ssv_bitset":3,"block_time":1,"generator":"rdHa4H5p1rVtQyqD","dur_ms":"9","label":"Final","ffr":false,"target":"node::chain::acceptor"}
/tmp/rust-harness.lTtg/node_2.log:{"timestamp":"2024-05-09T09:06:52.530574Z","level":"INFO","event":"block accepted","height":590,"iter":0,"hash":"196b889f51ceca54...d78c685493de136c","txs":0,"state_hash":"9854b9aeee6a3745...2373f8e05976690c","fsv_bitset":3,"ssv_bitset":3,"block_time":1,"generator":"rdHa4H5p1rVtQyqD","dur_ms":"9","label":"Final","ffr":false,"target":"node::chain::acceptor"}
/tmp/rust-harness.lTtg/node_3.log:{"timestamp":"2024-05-09T09:06:52.509718Z","level":"INFO","event":"block accepted","height":590,"iter":0,"hash":"196b889f51ceca54...d78c685493de136c","txs":0,"state_hash":"9854b9aeee6a3745...2373f8e05976690c","fsv_bitset":3,"ssv_bitset":3,"block_time":1,"generator":"rdHa4H5p1rVtQyqD","dur_ms":"7","label":"Final","ffr":false,"target":"node::chain::acceptor"}
Additional context An identical problematic situation has been occurring in the ITN/testnet, where there is race between GetData request and block acceptance too.
Option 1: Assumes that GetResource
msg brings the block certificate
.
function handleGetResource(hash):
begin
block = searchInLedgerCF(hash)
if block is not found:
block = searchInCandidatesCF(hash)
if block is found:
attachCertificate(block)
sendBlockToRequester(block, requester_addr)
end
*CF - RocksDB column Family
Option 2 Assumes that the Sender maintains a cache of received Quorum messages.
// in Receiver:
function handleGetResource(hash):
begin
block = searchInLedgerCF(hash)
if block is not found:
block = searchInCandidatesCF(hash)
if block is found:
// sends a candidate without certificate
sendCandidateToRequester(block, requester_addr)
end
// in Requester:
function handleCandidate(candidate_block)
begin
fetchCertificateFromCache(Hash)
block = attachCertificate(candidate_block)
on_block_event(block)
end
WRT Option 2:
Store/Persist Quorum Msg in RocksDB ColumnFamily. It will be deleted if it is fetched or it expires. On receiving a block without certificate, the requester looks for the corresponding Certificate in Quorum CF. Thus we can store many Quorums (on-disk) temporarily.