zebra icon indicating copy to clipboard operation
zebra copied to clipboard

Revert note commitment and history trees when forking non-finalized chains

Open teor2345 opened this issue 2 years ago • 6 comments

Motivation

When Zebra forks a non-finalized chain, it rebuilds the note commitment trees from the finalized tip. But this can execute 100 blocks worth of note commitment tree updates, which uses a lot of CPU.

We're seeing rebuild times of around 75 seconds, which is way too long:

2022-07-20T20:34:08.026902Z INFO {zebrad="b52b10c" net="Main"}:{peer=Out("3.72.134.66:8233")}:msg_as_req{msg="inv"}:inbound:download_and_verify{hash=00000000017131670f50301f97d80699baaef991ffed0c4ec9b6a0b0c189e587}:state: zebra_state::service::non_finalized_state::chain: finished rebuilding note commitment trees after a non-finalized chain fork rebuild_time=75.654621465s fork_height=Height(1744525) fork_tip=block::Hash("0000000000f4dd1105f94c06f28189ebdd60c0895d8931486847cce3278590e5")

Priority

We added a log in PR #4795, it will tell us how important this fix is.

Designs

Instead of rebuilding the trees from the root, we should remove the trees and anchors that are above the fork.

Refactors:

  • store history trees by height
  • replace tip sprout, sapling, orchard, history trees with methods that do a lookup using the tip height

Change:

  • during a fork, revert_chain_with() sprout, sapling, orchard, history trees by removing anchors and trees above the fork height

Cleanup:

  • replace Chain::new() with derive(Default)
  • replace Chain::with_trees() with derive(Clone)
  • cleanup unused arguments from Chain::fork()
  • cleanup any methods in NonFinalizedState that call these methods

Related Work

teor2345 avatar Jul 19 '22 21:07 teor2345

Note that it's non-trivial to remove nodes from a Frontier (at least I couldn't figure it out quickly when I tried). But since we store the trees for each height now, we can just start from the fork point and add the new nodes.

conradoplg avatar Jul 20 '22 14:07 conradoplg

We're seeing rebuild times of around 75 seconds, which is way too long:

2022-07-20T20:34:08.026902Z INFO {zebrad="b52b10c" net="Main"}:{peer=Out("3.72.134.66:8233")}:msg_as_req{msg="inv"}:inbound:download_and_verify{hash=00000000017131670f50301f97d80699baaef991ffed0c4ec9b6a0b0c189e587}:state: zebra_state::service::non_finalized_state::chain: finished rebuilding note commitment trees after a non-finalized chain fork rebuild_time=75.654621465s fork_height=Height(1744525) fork_tip=block::Hash("0000000000f4dd1105f94c06f28189ebdd60c0895d8931486847cce3278590e5")

So this is a high priority.

teor2345 avatar Jul 20 '22 21:07 teor2345

Note that it's non-trivial to remove nodes from a Frontier (at least I couldn't figure it out quickly when I tried).

I think Frontier doesn't support that operation at all.

upbqdn avatar Jul 22 '22 14:07 upbqdn

Zebra's performance is good enough for now.

teor2345 avatar Aug 23 '22 00:08 teor2345

We're not making changes of this size after tagging the release candidate.

teor2345 avatar Oct 11 '22 00:10 teor2345

This is mitigated by #5257, which runs these slow updates in a separate thread.

teor2345 avatar Oct 12 '22 00:10 teor2345

@mpguerra we've been told by a mining pool that fork speed is a significant issue for them.

I've updated this ticket based on the current code, we've done most of the work here already, due to other state tickets.

teor2345 avatar Dec 13 '22 21:12 teor2345

Thanks for the updates! This seems like a significant change so, in terms of scheduling, we'd probably want to evaluate how this would affect the stable release candidate series and the audit.

mpguerra avatar Dec 14 '22 10:12 mpguerra

Thanks for the updates! This seems like a significant change so, in terms of scheduling, we'd probably want to evaluate how this would affect the stable release candidate series and the audit.

I think this change is a high priority for miners. They are unlikely to use Zebra unless it is fixed, because it prevents them mining 1-2 blocks after each chain fork.

Here is my evaluation of the complexity of this change:

This change is all contained within the chain.rs module. It should change less than 100 lines of code. As part of this change, we can delete some complex tree update code that's no longer needed.

This will fix a current performance bug in the stable release candidate series. We almost fixed this bug before the audit, because the performance is much worse than zcashd. But we ran out of time.

We already have unit and full sync tests for this code, so it is a low risk change.

teor2345 avatar Dec 14 '22 20:12 teor2345

I've added a note about the consensus rules to the ticket description. If we do the change the way I described it in the ticket, then keeping the consensus rules actually becomes a lot simpler.

So I can't see anything complicated about this ticket. Maybe Deirdre or Conrado could double-check?

teor2345 avatar Dec 14 '22 20:12 teor2345

So I can't see anything complicated about this ticket. Maybe Deirdre or Conrado could double-check?

@conradoplg do you mind double checking this?

mpguerra avatar Dec 15 '22 09:12 mpguerra

I agree it's not complicated, the old code was basically a workaround, and we will able to replace it with much neater code. This part of the code is well tested so it is unlikely that a bug will be introduced. The hardest part will understanding the old code, but I can help if whoever implements this is not familiar with it.

conradoplg avatar Dec 15 '22 18:12 conradoplg

I can also help with implementation or reviews.

teor2345 avatar Dec 15 '22 21:12 teor2345

moving this to Sprint 3, we should prioritise testing instead for this sprint

mpguerra avatar Jan 17 '23 13:01 mpguerra