snarkOS
snarkOS copied to clipboard
[Bug] Leader certificate withholding halts a share of nodes that bond&unbond
🐛 Bug Report
When a malicious node withholds certificates in the round they are the leader, they cause a large share of nodes that bond&unbond to halt.
Explanation
-
Steps in our implementation
- Initially, there are 4 nodes connected, and 16 nodes not yet connected to the network. In total, there are 20 nodes
- All nodes generate and submit transactions (as opposed to only the first node 10)
- Byzantine node 1 will always withhold certificates in the round that they are leader
- We implement dynamic committee changing where the size of the committee keeps changing from 4 to 20 to 4 to 20 … (indefinitely running, nodes 5-20 bond and unbond)
-
Intuition
- The intuition for the attack is that when a Byzantine node is the leader in round r, by withholding their certificate for the duration of that round, no one will be able to commit their anchor immediately. However, in a later round, the Byzantine node will provide the certificate when an honest node requests. As a result, if this anchor contains bonding and unbonding transactions, the honest nodes will see a changed committee only in hindsight.
- Node 0 always remains across committees to ensure we always have the same node withholding transactions.
Steps to Reproduce
- Clone this repo: https://github.com/vicsn/aleo-stress-test/tree/main/kp-scripts/test-008
- Have 20 AWS EC2 nodes ready, and configured
- Install the branch kp/fix/stress_test_3 on all nodes (you can use the reinstall.sh script)
- This branch ensures that all nodes always send transactions
- Install the branch kp/fix/stress_test_1 on node 0 (you can use the reinstall2.sh script)
- This branch additionally implements the certificate withholding functionality
- Use the A_run_network.sh script which automatically:
- Starts the network with initially 4 nodes
- Transfers funds to 16 further addresses of validators
- Starts the 16 further validators
- Indefinitely bonds and unbonds the 16 validators
- Once all 20 nodes are started, you can use the script print_block_heights2.py to monitor the block heights of the nodes
Findings
-
Eventually, a large share of nodes 4-19 halt:

-
Some halted nodes with the higher block heights may be able to catch up later, but the ones with lower ones seem to halt indefinitely. Nodes that halt show these error messages (Protocol violation and red errors):

-
While unbonding nodes, there is also the message that nodes are connected to more validators than there are validators in the committee set:

-
After a halted validator node is restarted, it successfully syncs again!