sui icon indicating copy to clipboard operation
sui copied to clipboard

Rounds API not completing

Open jsteenb2 opened this issue 3 years ago • 2 comments

Description

After removing collections, the Rounds API fails to return a response to the caller. Remains open until the process is killed. Kills any forward progress on creating a proposed block.

Steps to reproduce

(prereq - create grpc client from the proto, and make a release build of the node and move it into your path and change the name to narwhal_node)

  1. Start up narwhal network stack with configurations contained in results.zip
  2. Wait until start up is complete for all workers and primaries
  3. Setup worker clients and primary clients for all nodes
  4. Via round robin, cycle through all worker clients and submit 500 TXs with body (tx-0) through (tx-499) (in total)
  5. Sleep 3 seconds
  6. Chose a primary client and identify it as a proposer
  7. Obtain oldest and newest rounds via Rounds API
  8. Obtain collections for a proposal via node_read_causal API
  9. Obtain TXs from collection of proposal via get_collections API
  10. Print TXs, save to disk, etc (w/e floats your debugging boat 😄 )
  11. Issue remove collections at proposer and all other primary clients
  12. Pick a new proposer from the primary clients (I chose one round robin) and restart at step 6
    • the restart here will fail to complete, the connection never completes

Additional notes, I have been unable to obtain all 500 TXs I've submitted regardless how long I wait, or how big/small the batches/collections are.

The code to run these tests in an automated fashion can be found here, specifically in the narwhal_network_test.go file

Logs, db state, etc all included in results.zip results.zip

jsteenb2 avatar Jul 22 '22 14:07 jsteenb2

@jsteenb2 following our conversation summarising the observations we had so far:

  1. Try to set the batch_size to a higher value than 5 bytes (ex try with 50 or 100 to begin with and see the results)
  2. Try to give some extra time before you start posting your transactions and after you finished posting them (maybe instead of 3 seconds give something like 6-7 and retry)
  3. The reason your second rounds call never completes is because of an infinite loop issue within the DAG compression which we'll be looking to resolve here

akichidis avatar Jul 27 '22 20:07 akichidis

@jsteenb2 it appears that the issue of the infinite loop is due to a refactoring in the DAG where a recursion call was changed to an iteration (thanks @huitseeker for discovering the root cause!). The change has been reverted here.

I've produced the binary from that commit and tried to re-run the tests. The issue seems to have been fixed and but another nasty bug got surfaced - you can see the details here. We'll need to look better into it to confirm and prioritise accordingly.

akichidis avatar Aug 02 '22 21:08 akichidis