sui
sui copied to clipboard
[narwhal] reject header with rounds too advanced
When a narwhal node is falling behind or catching up after restarting, it will receive headers that have rounds much higher than its latest processed certificates. Trying to gather data and vote on these headers are not useful work: by the time this node has caught up on parent certificates, the vote on the header is likely no longer needed.
For liveness, I believe it should be the header proposing node's responsibility to retry the header again or at a higher round if it still needs votes for it.
Are we sure the header's author will re-try to retransmit its header if it needs more votes? Otherwise we may loose liveness (as you say)
Are we sure the header's author will re-try to retransmit its header if it needs more votes? Otherwise we may loose liveness (as you say)
I need to read the code more to make sure, but my mental model is that the following scenarios should be equivalent on their effect on liveness.
- not voting on a header during a transient period
- votes on a header is dropped by the network (transiently)
If (1) hurts liveness, the system seems not going to be "live" after (2) either, i.e. the system will be permanently stuck after network issues. If the system indeed cannot handle temporary network problems, definitely it is a much higher priority issue to fix than this. I can look into retrying headers etc.
Are we sure the header's author will re-try to retransmit its header if it needs more votes? Otherwise we may loose liveness (as you say)
I need to read the code more to make sure, but my mental model is that the following scenarios should be equivalent on their effect on liveness.
- not voting on a header during a transient period
- votes on a header is dropped by the network (transiently)
If (1) hurts liveness, the system seems not going to be "live" after (2) either, i.e. the system will be permanently stuck after network issues. If the system indeed cannot handle temporary network problems, definitely it is a much higher priority issue to fix than this. I can look into retrying headers etc.
Ha! I see your point. So the idea is that header.author
will continue to retry until it gets enough votes. As a result, if too many nodes are slow, header.author
will halt its progress and simply waits for the slow nodes to catch up; only then it will have its header certified. Did I get this right?
If so, isn't this equivalent to the current system? We currently do vote for header
and header.author
gets its new certificate quickly. It will however halt until it gets enough parents to move to the next round. Aren't we applying the same control system in a different way?
Are we sure the header's author will re-try to retransmit its header if it needs more votes? Otherwise we may loose liveness (as you say)
I need to read the code more to make sure, but my mental model is that the following scenarios should be equivalent on their effect on liveness.
- not voting on a header during a transient period
- votes on a header is dropped by the network (transiently)
If (1) hurts liveness, the system seems not going to be "live" after (2) either, i.e. the system will be permanently stuck after network issues. If the system indeed cannot handle temporary network problems, definitely it is a much higher priority issue to fix than this. I can look into retrying headers etc.
Ha! I see your point. So the idea is that
header.author
will continue to retry until it gets enough votes. As a result, if too many nodes are slow,header.author
will halt its progress and simply waits for the slow nodes to catch up; only then it will have its header certified. Did I get this right?If so, isn't this equivalent to the current system? We currently do vote for
header
andheader.author
gets its new certificate quickly. It will however halt until it gets enough parents to move to the next round. Aren't we applying the same control system in a different way?
I took another look at the Core::process_own_header()
logic. My understanding is that it broadcasts a header and collects votes in a best effort manner. If the header cannot collect a quorum of votes, eventually the primary will receive certificates for later rounds and move on to the next round itself. I'm not sure liveness is ensured with the current logic. Suppose a network issue drops all header broadcasts, and all nodes are at the same round, maybe I'm missing things but I don't know how the system will recover. This is why I'm proposing adding retries for header broadcast.
The discussion about liveness is independent from this PR though. IIUC this PR has no marginal effect on liveness, because it is equivalent to the network dropping votes.
I took another look at the
Core::process_own_header()
logic. My understanding is that it broadcasts a header and collects votes in a best effort manner. If the header cannot collect a quorum of votes, eventually the primary will receive certificates for later rounds and move on to the next round itself. I'm not sure liveness is ensured with the current logic. Suppose a network issue drops all header broadcasts, and all nodes are at the same round, maybe I'm missing things but I don't know how the system will recover. This is why I'm proposing adding retries for header broadcast.The discussion about liveness is independent from this PR though. IIUC this PR has no marginal effect on liveness, because it is equivalent to the network dropping votes.
@mwtian I see your point - and probably this triggers a broader discussion on the definition of "receipt" across the nodes (I'll come back to that later). A couple of points though regarding the above:
- on the
Core::process_own_header()
we do broadcast theheader
indeed to the other nodes. However this is a reliable broadcast, meaning that our node will keep retrying sending the header to the other nodes until the others successfully receive them. To note here the definition of "success/receipt" here is purely in an acknowledgment fashion - not actual processing takes place. So even in the case where a network issue happens that drops all the header broadcasts, the proposer nodes will keep retrying until succeed. - Regarding the votes, we use the same "reliable" mechanism there as well. When a validator sends their vote they'll keep retrying until it has succeeded (the ReliableNetwork is used).
So when it comes to network issues the current code provides enough guarantees. Now - and coming back to the definition of "receipt" thing - we have no guarantees that a node that receives a valid header or vote will eventually process them. For instance if a node received a header
, then crashed, and then brought up again , it will not "resume" the processing of the header. However, in order for this to lead to a loss of liveness we need at least f + 1
failures of this fashion (ack the receipt of header and never vote - which is somehow equal to having f + 1
byzantine nodes I believe) . Ideally to solve this as you said we can either:
(1) let the proposer try to resend the header (as you suggested) when no votes have been received for past X seconds for example (and no certificates of higher round have been received by then)
(2) a bootstraping node can request all the latest proposed headers by their peers so they can vote if needed
However, the above might be a bit different/more general issue. Back to the scope of this PR, as @asonnino said we might kill liveness as we currently don't have the mechanism to re-broadcast a header for which we have a high delay on vote receipt (again we do have retries for network issues). On another note, what's the worst thing that can happen if we just "accept" those high round headers for processing? I mean, we'll park them anyways until we have done the causal completion and restore them for voting once back. Sure we'll waste more memory but is there anything more?
@akichidis I took a look at the code. Yes you are right that ReliableNetwork
retries errors including timeouts, which I did not realize earlier. Indeed ack'ing Header does not yet guarantee it will ever be voted on, because of restarts.
And I still believe this change does not affect the marginal liveness of the system. Any time dropping a header might have affect liveness, I believe the same scenario could have happened with network issues or restarts.
The motivation for this change is that right now it can take awhile for a narwhal node to catch up. During the catchup period if the node continues to keep the headers it receives, the headers are unlikely to get voted in time, but still consumes memory, non-trivial CPU, and significant Core
loop time (for re-processing header and voting). And there were deadlocks in HeadWaiter due to too many headers (but this has been fixed). The observation is that dropping headers with round numbers too high made the system catch up faster and in a more stable manner.
We should not put in a condition on the header round, even if it takes a lot of time to catch up and vote for it. In the case that we are in a halted state from more than a quorum of nodes being down and one comes back up forming enough live nodes for a quorum, if that restored node is more behind the other nodes than the defined catchup threshold here, the system will not resume liveness.
Thanks for taking a look. Can the reason why the system will not resume liveness be elaborated? After a restart, all suspended headers in the header waiter are lost anyway?
The existing logic related to proposing and voting for headers may indeed have liveness issue after restart, which I or someone else can follow up after @aschran's refactor to convert proposing and voting to request-response pattern.
The invariants for liveness document goes over the approach to maintaining liveness, which is a rather delicate 3 part solution, each which inter-depends on the other.
t looks like this PR is an optimization, but considering that we are in the process of testing the loss of liveness solution, it is not quite the time to optimize yet in a way that will change or complicate the loss of liveness solution.
As others have mentioned, there is no mechanism to re-broadcast a header on a node that is continuously running. If we were to implement this type of optimization, we would also need to have a task that continually re-broadcasts headers, but I don't think that is worth the tradeoff considering the scenario I mentioned in my earlier comment. In that specific case, waiting and catching up on all the missing state is inevitably needed to make progress.
Closing since there is https://github.com/MystenLabs/sui/pull/5990