zebra
zebra copied to clipboard
Revert note commitment and history trees when forking non-finalized chains
Motivation
When Zebra forks a non-finalized chain, it rebuilds the note commitment trees from the finalized tip. But this can execute 100 blocks worth of note commitment tree updates, which uses a lot of CPU.
We're seeing rebuild times of around 75 seconds, which is way too long:
2022-07-20T20:34:08.026902Z INFO {zebrad="b52b10c" net="Main"}:{peer=Out("3.72.134.66:8233")}:msg_as_req{msg="inv"}:inbound:download_and_verify{hash=00000000017131670f50301f97d80699baaef991ffed0c4ec9b6a0b0c189e587}:state: zebra_state::service::non_finalized_state::chain: finished rebuilding note commitment trees after a non-finalized chain fork rebuild_time=75.654621465s fork_height=Height(1744525) fork_tip=block::Hash("0000000000f4dd1105f94c06f28189ebdd60c0895d8931486847cce3278590e5")
Priority
We added a log in PR #4795, it will tell us how important this fix is.
Designs
Instead of rebuilding the trees from the root, we should remove the trees and anchors that are above the fork.
Refactors:
- store history trees by height
- replace tip sprout, sapling, orchard, history trees with methods that do a lookup using the tip height
Change:
- during a fork,
revert_chain_with()
sprout, sapling, orchard, history trees by removing anchors and trees above the fork height
Cleanup:
- replace
Chain::new()
withderive(Default)
- replace
Chain::with_trees()
withderive(Clone)
- cleanup unused arguments from
Chain::fork()
- cleanup any methods in
NonFinalizedState
that call these methods
Related Work
Note that it's non-trivial to remove nodes from a Frontier
(at least I couldn't figure it out quickly when I tried). But since we store the trees for each height now, we can just start from the fork point and add the new nodes.
We're seeing rebuild times of around 75 seconds, which is way too long:
2022-07-20T20:34:08.026902Z INFO {zebrad="b52b10c" net="Main"}:{peer=Out("3.72.134.66:8233")}:msg_as_req{msg="inv"}:inbound:download_and_verify{hash=00000000017131670f50301f97d80699baaef991ffed0c4ec9b6a0b0c189e587}:state: zebra_state::service::non_finalized_state::chain: finished rebuilding note commitment trees after a non-finalized chain fork rebuild_time=75.654621465s fork_height=Height(1744525) fork_tip=block::Hash("0000000000f4dd1105f94c06f28189ebdd60c0895d8931486847cce3278590e5")
So this is a high priority.
Note that it's non-trivial to remove nodes from a Frontier (at least I couldn't figure it out quickly when I tried).
I think Frontier
doesn't support that operation at all.
Zebra's performance is good enough for now.
We're not making changes of this size after tagging the release candidate.
This is mitigated by #5257, which runs these slow updates in a separate thread.
@mpguerra we've been told by a mining pool that fork speed is a significant issue for them.
I've updated this ticket based on the current code, we've done most of the work here already, due to other state tickets.
Thanks for the updates! This seems like a significant change so, in terms of scheduling, we'd probably want to evaluate how this would affect the stable release candidate series and the audit.
Thanks for the updates! This seems like a significant change so, in terms of scheduling, we'd probably want to evaluate how this would affect the stable release candidate series and the audit.
I think this change is a high priority for miners. They are unlikely to use Zebra unless it is fixed, because it prevents them mining 1-2 blocks after each chain fork.
Here is my evaluation of the complexity of this change:
This change is all contained within the chain.rs
module. It should change less than 100 lines of code. As part of this change, we can delete some complex tree update code that's no longer needed.
This will fix a current performance bug in the stable release candidate series. We almost fixed this bug before the audit, because the performance is much worse than zcashd
. But we ran out of time.
We already have unit and full sync tests for this code, so it is a low risk change.
I've added a note about the consensus rules to the ticket description. If we do the change the way I described it in the ticket, then keeping the consensus rules actually becomes a lot simpler.
So I can't see anything complicated about this ticket. Maybe Deirdre or Conrado could double-check?
So I can't see anything complicated about this ticket. Maybe Deirdre or Conrado could double-check?
@conradoplg do you mind double checking this?
I agree it's not complicated, the old code was basically a workaround, and we will able to replace it with much neater code. This part of the code is well tested so it is unlikely that a bug will be introduced. The hardest part will understanding the old code, but I can help if whoever implements this is not familiar with it.
I can also help with implementation or reviews.
moving this to Sprint 3, we should prioritise testing instead for this sprint