zebra icon indicating copy to clipboard operation
zebra copied to clipboard

Zebra falsely activates the mempool when the connection goes down during a sync

Open upbqdn opened this issue 2 years ago • 9 comments

Motivation

Zebra currently estimates that it is close to the tip if the average number of synced blocks in the last four batches drops below 20. The decision is made in the function is_close_to_tip here https://github.com/ZcashFoundation/zebra/blob/update-column-family-names/zebrad/src/components/sync/status.rs#L67. Zebra uses the output of this function to activate the mempool.

However, when Zebra loses its peers during the syncing process, is_close_to_tip returns a false positive because the number of synced blocks in the last four batches naturally drops below 20. This happens, for example, when the internet connection goes down on the machine where Zebra is running.

Solution

Make sure Zebra has enough connections before is_close_to_tip returns the final decision.

Steps
  • Move WatchReceiver to zebra-chain so that you can use it from both zebra-state and zebra-network
  • Create a new channel in https://github.com/zcashfoundation/zebra/blob/75a679792bb787b95b4e9ce87aaefe205c48c97a/zebra-network/src/peer_set/initialize.rs#L159, store the sender in PeerSet, and return the receiver.
  • Use the sender in update_metrics in PeerSet.
  • Store the receiver in SyncStatus as WatchReceiver, and use it in is_close_to_tip.
  • Test that Zebra does not activate the mempool when it has no peers.

upbqdn avatar Jun 20 '22 15:06 upbqdn

Make sure Zebra has enough connections before is_close_to_tip returns the final decision.

What happens if we're using testnet in a box, or we only have a small number of working peers, but we are actually near the tip?

Here are some possible solutions:

  • use a different minimum for testnet, because it only usually has 5-8 peers in total
  • check that we're near the previous maximum number of peers (but peers can go down)
  • check that we're near the estimated tip, based on our local clock (but clocks can be wrong)

teor2345 avatar Jun 20 '22 23:06 teor2345

What happens if we're using testnet in a box, or we only have a small number of working peers, but we are actually near the tip?

I think checking if Zebra has at least one or two peers before returning a true from is_close_to_tip would be sufficient.

upbqdn avatar Jun 21 '22 15:06 upbqdn

What happens if we're using testnet in a box, or we only have a small number of working peers, but we are actually near the tip?

I think checking if Zebra has at least one or two peers before returning a true from is_close_to_tip would be sufficient.

Ok, sounds good, I would suggest just one peer (that's a common "testnet in a box" setup).

teor2345 avatar Jun 21 '22 23:06 teor2345

We're making network issues a lower priority for now.

teor2345 avatar Aug 14 '22 21:08 teor2345

This might add unwanted load to other nodes, so it's worth fixing.

teor2345 avatar Aug 27 '22 04:08 teor2345

Marek and I had a chat about this ticket. It's not a release blocker, so we're going to update the ticket with a design and a draft branch, and move it out of this sprint.

We can fix it if it ever becomes a problem for users.

teor2345 avatar Sep 28 '22 23:09 teor2345

I updated the solution in the PR description with implementation details.

upbqdn avatar Oct 02 '22 23:10 upbqdn

We're not making changes this big while the audit is pending

teor2345 avatar Oct 12 '22 00:10 teor2345

This is actually happening and causing CI failures:

assertion failed: output.stdout_line_contains("activating mempool").is_err() https://github.com/ZcashFoundation/zebra/actions/runs/3237875279/jobs/5305448541#step:15:520

teor2345 avatar Oct 12 '22 22:10 teor2345

This might have caused PR #5596 to fail in the merge queue:

Message: assertion failed: output.stdout_line_contains("activating mempool").is_err()

https://github.com/ZcashFoundation/zebra/actions/runs/3430430965/jobs/5717404168#step:3:226

teor2345 avatar Nov 09 '22 19:11 teor2345

This doesn't seem to be causing any issues for users?

teor2345 avatar Feb 01 '23 22:02 teor2345

Can re-open if needed in future

mpguerra avatar Feb 02 '23 16:02 mpguerra