Rounds API not completing
Description
After removing collections, the Rounds API fails to return a response to the caller. Remains open until the process is killed. Kills any forward progress on creating a proposed block.
Steps to reproduce
(prereq - create grpc client from the proto, and make a release build of the node and move it into your path and change the name to narwhal_node)
- Start up narwhal network stack with configurations contained in
results.zip - Wait until start up is complete for all workers and primaries
- Setup worker clients and primary clients for all nodes
- Via round robin, cycle through all worker clients and submit 500 TXs with body (tx-0) through (tx-499) (in total)
- Sleep 3 seconds
- Chose a primary client and identify it as a proposer
- Obtain oldest and newest rounds via Rounds API
- Obtain collections for a proposal via node_read_causal API
- Obtain TXs from collection of proposal via get_collections API
- Print TXs, save to disk, etc (w/e floats your debugging boat 😄 )
- Issue remove collections at proposer and all other primary clients
- Pick a new proposer from the primary clients (I chose one round robin) and restart at step 6
- the restart here will fail to complete, the connection never completes
Additional notes, I have been unable to obtain all 500 TXs I've submitted regardless how long I wait, or how big/small the batches/collections are.
The code to run these tests in an automated fashion can be found here, specifically in the narwhal_network_test.go file
Logs, db state, etc all included in results.zip
results.zip
@jsteenb2 following our conversation summarising the observations we had so far:
- Try to set the
batch_sizeto a higher value than5 bytes(ex try with 50 or 100 to begin with and see the results) - Try to give some extra time before you start posting your transactions and after you finished posting them (maybe instead of 3 seconds give something like 6-7 and retry)
- The reason your second
roundscall never completes is because of an infinite loop issue within the DAG compression which we'll be looking to resolve here
@jsteenb2 it appears that the issue of the infinite loop is due to a refactoring in the DAG where a recursion call was changed to an iteration (thanks @huitseeker for discovering the root cause!). The change has been reverted here.
I've produced the binary from that commit and tried to re-run the tests. The issue seems to have been fixed and but another nasty bug got surfaced - you can see the details here. We'll need to look better into it to confirm and prioritise accordingly.