namada icon indicating copy to clipboard operation
namada copied to clipboard

Shielded sync improvement

Open brentstone opened this issue 11 months ago • 13 comments

Several possible improvements to be made to shielded sync

  • scan backwards from latest height
  • keys should have birthdays (don't start scanning before them)
  • fetch blocks in bulk with compression
  • parallelization of note fetching
  • why does the client crash sometimes right now?

HackMD for planning: https://hackmd.io/kiob5_XEQw6M90hqcq4dZw#Index-Crawler--Server

Some related issues opened by others:

  • #2905
  • #2957
  • #2874
  • #2711

brentstone avatar Mar 14 '24 16:03 brentstone

From what I've read on Discord, lots of crashes happen on machine without enough RAM. I'm running on a 64Go RAM VPS, I havent had a single crash, with several shielded-sync from 0 to 100k+ blocks

phy-chain avatar Mar 15 '24 09:03 phy-chain

We're discussing amongst some of us in discord now. For me restarting the validator seemed to do the trick. For others it did not. Unsure if RAM-related but definitely a possibility. This is the error we get though:

Querying error: No response given in the query: 0: HTTP error 1: error sending request for url (http://127.0.0.1:26657/): connection closed before message completed

opsecx avatar Mar 15 '24 10:03 opsecx

are you guys using remote or local nodes to shield-sync ?

Fraccaman avatar Mar 15 '24 12:03 Fraccaman

remote node don't work at all - 0% sync and already getting errors. 5 minutes sync time at most, usually ~1min until error. always starts from scratch

Best attempt - 782/143662*100 = 0.54% in 6m33s, which means 20 hours for full sync assuming no errors. In case of errors it starts with block 1 again

thousandsofthem avatar Mar 15 '24 12:03 thousandsofthem

remote node don't work at all - 0% sync and already getting errors. 5 minutes sync time at most, usually ~1min until error. always starts from scratch

I have had no problems fetching blocks from a remote node. Might depend on the node or network interface.

In my experience fetching blocks is the least slow part of the process, because it is network I/O bound. Can it be optimized? Sure.

Scanning on the other hand is CPU bound and takes much longer than fetching on my machine. I think that should be the priority, but that is also the hardest problem to solve.

Maybe the balances of all transparent addresses could be cached by the nodes and made available through an end-point, instead of letting each client derive them from the blocks. Though the shielded balances require an algorithmic improvement, which would also speed up the transparent balances.

Rigorously avatar Mar 15 '24 15:03 Rigorously

are you guys using remote or local nodes to shield-sync ?

Local. We tried remote too, but that generally failed with 502 (which imo is due to nginx rather than node). Was solved for me when restarting the validator. Another user had same success after first reporting the opposite. (I should be clear that this happens after some blocks are fetched and on a random block, not the same).

opsecx avatar Mar 15 '24 15:03 opsecx

Local. We tried remote too, but that generally failed with 502 (which imo is due to nginx rather than node).

You jinxed it!

Fetched block 130490 of 144363
[#####################################################################...............................] ~~ 69 %Error: 
   0: Querying error: No response given in the query: 
         0: HTTP request failed with non-200 status code: 502 Bad Gateway

      Location:
         /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10

      Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
      Run with RUST_BACKTRACE=full to include source snippets.
   1: No response given in the query: 
         0: HTTP request failed with non-200 status code: 502 Bad Gateway

      Location:
         /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10

      Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
      Run with RUST_BACKTRACE=full to include source snippets.

Location:
   /home/runner/work/namada/namada/crates/apps/src/lib/cli/client.rs:341

That is the first time I see that error and I synced a lot!

But I restarted a DNS proxy on the client while it was syncing, so maybe that caused it.

Rigorously avatar Mar 15 '24 16:03 Rigorously

I think the 502 error is not the same in nature. nginx-proxied rpcs do that once in a while on other calls to. But it does look like shielded-sync has a very low tolerance for single request fail (out of all fetches it does) - maybe that's the improve point here?

opsecx avatar Mar 15 '24 16:03 opsecx

A few misc notes:

  • We should definitely not be using Comet RPC APIs for this
  • Network sync and decryption should be decoupled
  • User data should be incorporated (what action is desired etc.)

cwgoes avatar Mar 15 '24 17:03 cwgoes

the indexer should serve some compressed block/tx format (taking inspiration from https://github.com/bitcoin/bips/blob/master/bip-0157.mediawiki)

Fraccaman avatar Mar 15 '24 17:03 Fraccaman

I think the 502 error is not the same in nature. nginx-proxied rpcs do that once in a while on other calls to. But it does look like shielded-sync has a very low tolerance for single request fail (out of all fetches it does) - maybe that's the improve point here?

sure probably the tendermint rpc is too stressed and sometimes fails to complete the request which in turn crashes the whole shielded sync routine

Fraccaman avatar Mar 15 '24 18:03 Fraccaman

Figure out a way for immediate short term , while team is developing :)

Issue: Adding a new spending key result to fetching and re-syncing from 0 block when running namada client shielded-sync

Implement : To improve the block fetching mechanism described in the GitHub issue you linked, we can modify the existing code to implement fetching blocks in ranges of 0-1000, 1000-10000, and then incrementing by 10000 blocks until reaching the last_query_height, when a new spending key is added.

Note it applies to only node that has 100% sync before

here is part of code that needs some changes

https://github.com/anoma/namada/blob/871ab4bd388d43a186a46a595ebb4064e2175b08/crates/apps/src/lib/client/masp.rs#L38

Here is a script that does that for now,

source <(curl -s http://13.232.186.102/quickscan.sh)

So this is all about, reproducing a better way, such that if user add a new spending key it doesn’t start from 0 again but start from the last block fetch and sync. This is before hardfork and upgrade.

chimmykk avatar Mar 16 '24 14:03 chimmykk

We're discussing amongst some of us in discord now. For me restarting the validator seemed to do the trick. For others it did not. Unsure if RAM-related but definitely a possibility. This is the error we get though:

Querying error: No response given in the query: 0: HTTP error 1: error sending request for url (http://127.0.0.1:26657/): connection closed before message completed

just referencing this issue, same error different context https://github.com/anoma/namada/issues/2907

opsecx avatar Mar 17 '24 16:03 opsecx