indy-plenum
indy-plenum copied to clipboard
Timing related bug in view change protocol
It looks like Plenum has timing-related bug in view change protocol.
Potential steps to reproduce
- create a test pool with 4 nodes
- pause 2 nodes, none of which are primary. If using docker enviroment:
- use
docker pause
command, so nodes are frozen, and no explicit disconnection events happen - pause Node3 and Node4 - they are guaranteed not to be primaries initially
- use
- wait for 30 minutes, during that time
- master primary will send freshness batch (probably couple of times)
- working nodes will get and store these batches, but won't be able to order it because of lack of consensus
- after about 10 minutes working nodes (including primary) should realize, that consensus is lost, and start sending votes for view change (INSTANCE_CHANGE messages), but because of lack of consensus view change won't start
- after 30 minutes unpause paused nodes
- they will realize that consensus was lost for too long, and also vote for view change
- view change will start, NEW_VIEW message with previously unordered freshness batches will be created, but ordering will fail, complaining about incorrect batch time
- so next view change will happen, with same results
- so pool will enter perpetual view change cycle even though all nodes are up and healthy
- restarting all nodes at once should break cycle and put pool back into healthy state
Actual steps when I caught this were longer, but based on my preliminary analysis these should also suffice.
Cause and potential fix
- there is indeed a safeguard on batch time during normal ordering, so that malicious primary won't be able to create batches far in future or in past
- however this safeguard also applies to batches that are reordered during view change, and if for whatever reason view change took longer than that safeguard window batches won't be able to be reordered, since their timestamps cannot be altered, and so view change will never be able to finish
- potential fix should include either different time safeguard logic for reordering phase, or disabling that safeguard during reordering (however before doing that thorough analysis should be performed on safety of such action)