sui [narwhal] recover from loss of liveness

Steps to Reproduce Issue

On a 4 validator cluster we let the rounds advance and then we shutdown the 2 out of 4 nodes (f+1 failures). The protocol should stop and no round should advance.

The 2 nodes that crashed although they did receive the headers that the other 2 healthy nodes made, and acknowledged (network) the receipt of the headers, they crash and they never manage to vote for them.

We bring one of the crashed nodes up. The crashed node manages to make a proposal but it doesn't vote for any of the last sent headers that it previously missed.

Expected Result

Since the restarted node had previously acked on the receipt of the header, it should vote for it after the restart. That would allow the corresponding certificates getting created and eventually have a quorum of certificates available for round advancement.

Actual Result

The restarted node doesn't vote for the acked header after restart. That leads to the 2 healthy nodes that proposed their headers to never able to create the corresponding certificates - as not enough votes are gathered. Cluster stalls as not enough certificates are produced (quorum) in order for the round to advance.

Oct 27 '22 19:10 akichidis

Have the remaining nodes retried their headers?

Oct 28 '22 17:10 mwtian

Have the remaining nodes retried their headers?

Since the "crashed" nodes have acked the headers , then the "healthy" nodes won't have a reason to retry. Also, there is no feature in our code that retries a header when hasn't been voted for a while

Oct 28 '22 18:10 akichidis

Have the remaining nodes retried their headers?

Since the "crashed" nodes have acked the headers , then the "healthy" nodes won't have a reason to retry. Also, there is no feature in our code that retries a header when hasn't been voted for a while

Got it, I'm thinking about the need to retry headers when there are not enough votes.

Oct 28 '22 18:10 mwtian

Have the remaining nodes retried their headers?

Since the "crashed" nodes have acked the headers , then the "healthy" nodes won't have a reason to retry. Also, there is no feature in our code that retries a header when hasn't been voted for a while

Got it, I'm thinking about the need to retry headers when there are not enough votes.

Yeah we actually talked about two different solutions with @laura-makdah about it. Maybe we can have a quick catch up the three of us?

Oct 28 '22 18:10 akichidis

Have the remaining nodes retried their headers?

Since the "crashed" nodes have acked the headers , then the "healthy" nodes won't have a reason to retry. Also, there is no feature in our code that retries a header when hasn't been voted for a while

Got it, I'm thinking about the need to retry headers when there are not enough votes.

Yeah we actually talked about two different solutions with @laura-makdah about it. Maybe we can have a quick catch up the three of us?

Sounds good. Let me know anytime you catch up.

Oct 28 '22 18:10 mwtian

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Dec 28 '22 01:12 github-actions[bot]

[narwhal] recover from loss of liveness - last header missed

Steps to Reproduce Issue

Expected Result

Actual Result