nearcore icon indicating copy to clipboard operation
nearcore copied to clipboard

Feature request of Unsafe fast start of nearcore

Open khorolets opened this issue 2 years ago • 15 comments

After the recent incident with Indexer for Explorer, we've started the ["Improvement uptime" discussion](https://github.com/near/near-indexer-for-explorer/discussions/161)

One of the important stuff for an indexer is to be able to start very quickly even if it is not safe to do so, but very required.

We want to make sure we can quickly restart indexer-for-explorer even on testnet (where the genesis file is huge), maybe at a cost of adding --UNSAFE-fast-start flag which would disable genesis validation in nearcore somehow (I believe that is the main contributor to the start time)

/cc @frol

khorolets avatar Sep 06 '21 16:09 khorolets

@janewang @bowenwang1996 I would like to ask Node Experience team to consider implementing this feature.

Currently, Indexer Framework uses nearcore::config::load_config_without_genesis_records, but even then it takes tens of minutes to boot on testnet

https://github.com/near/nearcore/blob/b84b3fcdca7e3c221fa4adee35efeac6c2170637/chain/indexer/src/lib.rs#L104-L105

I feel we might want the same feature for neard. What do you think?

frol avatar Sep 06 '21 16:09 frol

I feel we might want the same feature for neard. What do you think?

@frol how urgent is this request? @posvyatokum could you take a look?

bowenwang1996 avatar Sep 06 '21 17:09 bowenwang1996

This feature is not urgent at the moment, but it would help a lot in a face of incidents that may happen any day

frol avatar Sep 06 '21 17:09 frol

Did some experiments with commenting out genesis validation on an archival testnet node. Disabling validation reduces the time between running neard and it getting to opening the DB by 1 minute.

Without validation: 59 sec and 47 sec

Nov 29 19:23:43.235  INFO neard: Version: 1.22.0, Build: 5a6fb2bd2-modified, Latest Protocol: 48
Nov 29 19:24:42.306  WARN near_chain_configs::genesis_validate: Skipping validation of genesis
Nov 29 19:24:42.306  INFO near: Opening store database at "/home/ubuntu/.near/data"
Nov 29 19:31:51.720  INFO neard: Version: 1.22.0, Build: 5a6fb2bd2-modified, Latest Protocol: 48
Nov 29 19:32:38.803  WARN near_chain_configs::genesis_validate: Skipping validation of genesis
Nov 29 19:32:38.803  INFO near: Opening store database at "/home/ubuntu/.near/data"
Nov 29 19:34:33.722  INFO stats: Server listening at 

With validation: 1m51s and 1m51s

Nov 29 19:25:04.670  INFO neard: Version: 1.22.0-rc.4, Build: 25b000ae4, Latest Protocol: 48
Nov 29 19:26:55.681  INFO near: Opening store database at "/home/ubuntu/.near/data"
Nov 29 19:28:50.131  INFO stats: Server listening at 
Nov 29 19:34:39.732  INFO neard: Version: 1.22.0-rc.4, Build: 25b000ae4, Latest Protocol: 48
Nov 29 19:36:30.826  INFO near: Opening store database at "/home/ubuntu/.near/data"
Nov 29 19:38:24.992  INFO stats: Server listening at

nikurt avatar Nov 29 '21 19:11 nikurt

@nikurt Hmm, that is odd! Mainnet node starts in 4 seconds:

Nov 24 08:33:20.509  INFO indexer_for_explorer: NEAR Indexer for Explorer v0.10.3 starting...
Nov 24 08:33:20.511  INFO indexer_for_explorer: construct_near_indexer_config
Nov 24 08:33:20.511  INFO indexer: Load config from /home/ubuntu/.near...
Nov 24 08:33:24.982  INFO indexer_for_explorer: Stream has started
Nov 24 08:33:24.982  INFO indexer: Starting Streamer...
Nov 24 08:33:24.985  INFO stats: Server listening at ed25519:[email protected]:24567

frol avatar Nov 30 '21 10:11 frol

I also remember my older experiments with quite ad-hoc switching the validation off and getting "instant" boot on testnet even in debug mode. I cannot take a minute to open the DB :thinking:

frol avatar Nov 30 '21 10:11 frol

@frol Tried with a non-archivial node, and opening the DB takes 4 minutes.

ubuntu@nikurt-3:~$ ./neard run                                                                                                                                                                                                                 
Nov 30 13:07:10.501  INFO neard: Version: 1.23.0-rc.1, Build: crates-0.10.0-70-g93e8521c9, Latest Protocol: 49
Nov 30 13:08:58.264  INFO near: Opening store database at "/home/ubuntu/.near/data"
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 98, kind: AddrInUse, message: "Address already in use" }', chain/jsonrpc/src/lib.rs:1374:6
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Nov 30 13:12:48.497  INFO stats: Server listening at ed25519:[email protected]:24567
ubuntu@nikurt-3:~$ ./neard.no_validation run                                                                                                                                                                                                   
Nov 30 13:13:10.540  INFO neard: Version: trunk, Build: crates-0.10.0-84-g5011d288c-modified, Latest Protocol: 49
Nov 30 13:13:58.887  INFO near: Opening store database at "/home/ubuntu/.near/data"
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 98, kind: AddrInUse, message: "Address already in use" }', chain/jsonrpc/src/lib.rs:1374:6
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Nov 30 13:17:50.874  INFO stats: Server listening at ed25519:[email protected]:24567                                                                                                                                                                                                                                                                                                                                                                     

nikurt avatar Nov 30 '21 13:11 nikurt

This is counter-intuitive to me. Is it debug or release build you test there?

frol avatar Nov 30 '21 13:11 frol

I'm experimenting with binaries built by make neard

nikurt avatar Nov 30 '21 16:11 nikurt

@nikurt we probably should avoid loading the genesis file itself in unsafe start given that testnet genesis is quite large. Also, we may even want to make this the default behavior if it is not the first time the node starts since it is very rare that genesis changes. If that actually happens, it seems fine to me to require manual intervention.

bowenwang1996 avatar Dec 01 '21 14:12 bowenwang1996

About 50 seconds are spent reading genesis here: https://github.com/near/nearcore/blob/29a8ae2e7a745d2d2d48c1b28f626fd91671064a/core/chain-configs/src/genesis_config.rs#L303

Also, why do we even need to read genesis twice? :)

nikurt avatar Dec 16 '21 13:12 nikurt

Also, why do we even need to read genesis twice? :)

I think we always compute the genesis hash to verify that it has not changed, but you are right -- there is no good reason to do that. The assumption should be that genesis does not change instead of the other way around.

bowenwang1996 avatar Dec 21 '21 00:12 bowenwang1996

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 21 '22 03:03 stale[bot]

@nikurt Was it completely resolved in #5888?

frol avatar Apr 20 '22 21:04 frol

@frol No, this needs more investigation.

nikurt avatar Apr 20 '22 21:04 nikurt

George Milescu commented:

Closing this issue in favour of https://pagodaplatform.atlassian.net/browse/ND-231 since both of them target the same improvements.

exalate-issue-sync[bot] avatar Jan 13 '23 14:01 exalate-issue-sync[bot]

George Milescu commented:

Reopening this issue to offer external visibility over the progress we are making.

exalate-issue-sync[bot] avatar Jan 16 '23 13:01 exalate-issue-sync[bot]