[narwhal] recover from loss of liveness - last header missed
Steps to Reproduce Issue
On a 4 validator cluster we let the rounds advance and then we shutdown the 2 out of 4 nodes (f+1 failures). The protocol should stop and no round should advance.
The 2 nodes that crashed although they did receive the headers that the other 2 healthy nodes made, and acknowledged (network) the receipt of the headers, they crash and they never manage to vote for them.
We bring one of the crashed nodes up. The crashed node manages to make a proposal but it doesn't vote for any of the last sent headers that it previously missed.
Expected Result
Since the restarted node had previously acked on the receipt of the header, it should vote for it after the restart. That would allow the corresponding certificates getting created and eventually have a quorum of certificates available for round advancement.
Actual Result
The restarted node doesn't vote for the acked header after restart. That leads to the 2 healthy nodes that proposed their headers to never able to create the corresponding certificates - as not enough votes are gathered. Cluster stalls as not enough certificates are produced (quorum) in order for the round to advance.
Have the remaining nodes retried their headers?
Have the remaining nodes retried their headers?
Since the "crashed" nodes have acked the headers , then the "healthy" nodes won't have a reason to retry. Also, there is no feature in our code that retries a header when hasn't been voted for a while
Have the remaining nodes retried their headers?
Since the "crashed" nodes have acked the headers , then the "healthy" nodes won't have a reason to retry. Also, there is no feature in our code that retries a header when hasn't been voted for a while
Got it, I'm thinking about the need to retry headers when there are not enough votes.
Have the remaining nodes retried their headers?
Since the "crashed" nodes have acked the headers , then the "healthy" nodes won't have a reason to retry. Also, there is no feature in our code that retries a header when hasn't been voted for a while
Got it, I'm thinking about the need to retry headers when there are not enough votes.
Yeah we actually talked about two different solutions with @laura-makdah about it. Maybe we can have a quick catch up the three of us?
Have the remaining nodes retried their headers?
Since the "crashed" nodes have acked the headers , then the "healthy" nodes won't have a reason to retry. Also, there is no feature in our code that retries a header when hasn't been voted for a while
Got it, I'm thinking about the need to retry headers when there are not enough votes.
Yeah we actually talked about two different solutions with @laura-makdah about it. Maybe we can have a quick catch up the three of us?
Sounds good. Let me know anytime you catch up.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.