nearcore
nearcore copied to clipboard
🔷 [Tracking issue] Resharding v2
Goals
Background
The goal of the resharding project is to implement fast and robust resharding.
Why should NEAR One work on this
Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded with respect to total state size and number of transactions. In addition, with state sync and stateless validation, validators will not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size. With future in-memory tries, it's also important to limit the size of individual shards.
What needs to be accomplished
The implementation should be robust enough so that we can later use it in Phase 2. The implementation should also allow for shard deletion in the future - meaning that any changes to the trie and the storage should support fast deletion.
Main use case
Once the project is completed we should be able to manually schedule a resharding of the largest shards in mainnet and testnet and the resharding should smoothly take place without any disruptions to the network.
Links to external documentations and discussions
Assumptions
- Flat storage is enabled.
- Shard split boundary is predetermined and hardcoded. In other words, necessity of shard splitting is manually decided.
- For the time being resharding as an event is only going to happen once but we would still like to have the infrastrcture in place to handle future resharding events with ease.
- Merkle Patricia Trie is the undelying data structure for the protocol state.
- Epoch is at least 6 hrs long for resharding to complete.
Pre-requisites
- Resharding must be fast enough so that both state sync and resharding can happen within one epoch.
- Resharding should work efficiently within the limits of the current hardware requirements for nodes.
- Potential failures in resharding may require intervention from node operator to recover.
- No transaction or receipt must be lost during resharding.
- Resharding must work regardless of number of existing shards.
- No apps, tools or code should hardcode the number of shards to 4.
Out of scope
- Dynamic resharding
- automatically scheduling resharding based on shard usage/capacity
- automatically determining the shard layout
- Merging shards or boundary adjustments
- Shard reshuffling
- Shard Layout determination logic. Shard boundaries are still determined offline and hardcoded.
Task list
mainnet release preparation
- [x] support resharding on split storage archival nodes
- [x] fix the bug when opening the snapshot
- [x] add split storage support to mocknet (need help from the node team here)
- [ ] write a test for this case
- [x] support node restart
- [x] in the building state phase
- [x] in the catch up phase
- [x] in the post catch up phase
- [ ] add test coverage for all the above
mainnet release
- [ ] ensure lake nodes are upgraded
- [ ] pause backups to avoid unnecessary restarts
- [ ] check the health of state dumpers zulip
implementation
- [x] #9202
- [x] #9418
- #9420
- #9421
- @shreyan-gupta is the owner
- [x] Implement scheduling resharding logic - predefined hardcoded boundary
- #9274
- [x] #9417
- #9335
- [x] Use flat storage at correct head
- [x] check if can enable flat storage snapshot on archival nodes
- yes zulip
- [x] consider using an on-demand snapshot instead of the state sync snapshot
- We decided to re-use the existing state sync snapshot solution due to simplicity
- [x] enable snapshot on archival nodes either permanently or conditionally for resharding
- [x] #10149
- [x] delay resharding until state snapshot is ready
- [x] test it
- [x] ensure flat storage snapshot exists or handle it missing
- [x] check if can enable flat storage snapshot on archival nodes
- [x] Fix outgoing receipts reassignment at shard layout switch
- #9362
- [x] Fix incoming receipts reassignment at shard layout switch
- #9467
- #9474
- [x] Fix transaction pool losing transactions at shard layout switch
- #9494
- [x] Fix gas and balance reassignment at shard layout switch
- #9449
- [x] Implement* triggering resharding after catchup
- [x] Implement* applying blocks to both old shard and new shards
- [x] Implement* shard layout switch at epoch boundary
- [x] #10140
- @shreyan-gupta is the owner
- [x] Pick a new boundary account to split an existing shard
- #10094
- #10119
- #10197
- #10240
-
tge-lockup.sweat
is a good boundary from storage point of view and splits the shard 36:30:32 -
token.sweat
is a good boundary from gas usage point of view and splits the shard 27:45:27
- [x] Relevant security vulnerability https://github.com/near/nearcore-private/issues/67
- [x] Fix the vulnerability #10055
- [x] Benchmark and throttle resharding to not impact node operation
- [x] #9994
- [x] zulip performance investigation
- [x] #10192
- #10143
- [x] Testing and stabilization
- [x] Test resharding in mocknet or betanet
- [x] small network no traffic
- [x] small network with traffic
- [x] Test resharding on real rpc and archival nodes in mainnet and testnet
- [x] Extend existing resharding tests to cover the V1->V2 migration
- #9345
- #9373
- #9402
- #9467
- [x] test missing chunk in the first block of new shard layout zulip
- #10052
- [x] handling error and corner cases
- #10179
- [x] https://github.com/near/nearcore/issues/5047
- #10296
- [x] Test resharding in mocknet or betanet
Operational
- [x] NEP
- [x] build monitoring and dashboard
- [x] Handle the increase in expected workload for validators
- [ ] Update public documentation on resharding: https://near.github.io/nearcore/architecture/how/resharding.html
- @wacban is the owner
- [x] update the docs
- [x] Once the dynamic config is merged, update the instructions in throttling section.
- [ ] Once the NEP is merged update the link to it - currently the PR is linked.
- [ ] Make the block production the same in testnet and mainnet
- @posvyatokum is the owner
- [x] Make throttling a dynamic config that can be updated without restarting the node
- #10296
- #10297
- [x] test in mocknet - visible speed up in the batch size graph - https://nearinc.grafana.net/goto/aXE8CeNIR?orgId=1
- [ ] Add support for pseudo archival nodes in mocknet and use it in pre-release testings
- @posvyatokum is the owner
- [x] Stress test resharding in mainnet and testnet
- mainnet rpc, mainnet archival, testnet rpc, testnet archival
- resharding takes about 3h00m in mainnet and 5h30min in testnet
- [x] Test failure recovery
- [x] start resharding in mocknet, crash a node in the middle of resharding, restart the node, make sure resharding still completes and all is fine - https://nearinc.grafana.net/goto/aAosJBvSR?orgId=1
- [ ] Notify all tools, infra, etc, developers are aware and prepared to handle the new shard layout.
- [ ] Prepare a RPC endpoint for querying the shard layout and the number of shards
- [x] EXPERIMENTAL_protocol_config contains shard layout
- [x] test that it works as expected
- [x] document how to use it
- [ ] create a dedicated endpoint just for querying the shard layout
- [x] EXPERIMENTAL_protocol_config contains shard layout
- [ ] Notify any other tools, infra, etc., developers and SRE about resharding
- [x] slack general https://pagodaplatform.slack.com/archives/C01R19A6NHX/p1700517714470499
- [x] slack near-releases https://pagodaplatform.slack.com/archives/C02C1NLNVL4/p1701781284466569
- [x] slack sre-support
- [x] zulip general
- [ ] devhub
- [x] near social
- [x] telegram NEAR protocol community group
- [x] telegram NEAR Tools community group
- [x] telegram NEAR Dev group
- [x] Discord Dev-updates
- [x] explorer
- [x] databricks
- from slack "from the Databricks side we should be good if the S3 files continues with the same name pattern shard.json"
- [x] query api
- [x] https://github.com/near/queryapi/issues/382
- [ ] Prepare a RPC endpoint for querying the shard layout and the number of shards
- [x] Make resharding interruptible - currently sigint is not enough to stop the node in the middle of resharding
- #10328
- [x] Stabilize
- #10303
Code Quality improvements
- [ ] Fix the shard id type inconsistency in ShardUId and in ShardId. Proper fix may require db migration but perhaps we can fix at the db api level.
- [x] Rename from "state split" to "resharding".
- #10393
- rename the db column - see this comment
Delayed until after the first rollout
- [ ] localize resharding only to relevant shards and improve shard naming
- Change that way shards are identified so that when resharding we only need to touch the relevant shards.
- Today due to having version in the ShardUId we always need to "reshard" all shards even if only one is getting split.
- [ ] state sync and resharding integration
- resharding discussion, Madrid Offsite
- resharding and state sync issue, zulip
- [ ] fix resharding of incoming receipt proofs
- [ ] fix or pause state dumping during resharding + tests zulip
- [ ] stateless validation and resharding integration (zulip)
- [ ] in memory trie and resharding integration
- [ ] Add a provisional V3 resharding in nightly to have the resharding tests check the stable -> nightly transition.
- [ ] Set the trie_cache size based on account rather than shard uids so that we don't need to update it with every resharding code ref
~~Brainstorming~~ COMPLETED
-
[x] Brainstorm ideas.
-
[x] Evaluate the available solutions.
- [x] #9101
- The results are promising but the type boundary split is expected to be even faster.
- [x] #9105
- The results are promising but the type boundary split is expected to be even faster.
- [x] [P0]#9198
- seems possible but quite complex and bug prone
- can't directly delete shard
- best approach is to GC back till the flat storage head and delete remaining data using flat storage
- [x] #9199
- This trie structure is feasible from the point of view of the current trie usage.
- There is a minor blocker but it can be resolved relatively easy.
- This solution should be evaluated once implemented to ensure that it doesn't introduce a performance regression.
- [x] [P0]#9200
- Would require a dedicated, separate solution to address deduplication for all trie data types.
- Would be nice for contracts with large data as it would put it close to each other in the trie.
- [x] #9201
- no longer needed
- [x] [P2]#9203
- no longer needed
- [x] [P0]#9204
- Migration is not feasible on archival nodes.
- Archival nodes may be getting depracated by read-rpc but it won't be soon.
- May still be possible by maintaining both implementations and keeping old data in the old format.
- [x] [P1]#9205
- Likely possible but need to check how long it would take on archival nodes.
- [ ] [P2]#9206
- no owner
- [ ] Investigate what's the best way to represent shards across resharding
- no owner
- Do we want to keep shard_uid or is there anything better?
- [x] #9101
-
[x] Select the best solutions
- [x] Select the best solution for the trie structrue and reshardings.
- We chose to keep the existing trie structure and use flat storage for resharding.
- [x] Select the best solution for the storage structrue and deletion
- We chose to keep the existing storage structure and use range deletion for deletion.
'*' - implement or verify existing implementation
- [x] Select the best solution for the trie structrue and reshardings.
Sep 14th
- Overall on track
- [Completed] Unhappy path handling for incoming receipts and transaction pools
- [WIP] flat storage snapshot handling (slow start due to Shreyan’s OOO)
- [WIP] testing improvements for resharding
[Note] The tracking issue will stay open throughout the project.
Sep 27th
- Happy path MVP is code complete with couple PRs waiting to be merged
- Currently the team is working on adding integration test and setting up mainnet-fork based mocknet test.
upcoming October roadmap is as follows:
- Target end state: E2E implementation including unhappy paths (MVP covering all known happy/unhappy paths) Successful run of happy path MVP
- Identify additional hardware requirement and communicate with validators
- Have a working demo ready for NEARcon
- TODO: Need to define what to be demo’ed
- Finalize NEP draft and schedule for SME review: ref
- Out of scope: refinement, additional tests, release prep
Resharding v2 is near completion and planned to be released with 1.37.
Couple things that are open at this moment:
- Get SME approval from Robin and Protocol working group before mainnet launch
- Final check to make sure there is no open issues before 1.37 branch cut
- Continuous mocknet testing to make sure resharding can be performed successfullly
- Failure case test to confirm things can recover upon failure
- Communication with SRE team so they are aware of it
- Validator communication
December 18th status update
- NEP is approved
- All necessary codes are merged in and scheduled for release with 1.37
- The team will look into the feedback received during NEP review regarding making sure indexer and RPC nodes don't suffer with resharding.
Jan 16th Status update
- Waiting for 1.37 testnet release for testnet shard split
- @pugachAG merged shard cache size fix for post resharding: #10409
- @pugachAG started investigation on whether we actually need 3gb shard cache for sweat shard, still running experiments
- Better engineering task merged: #10393
- @wacban chased explorer, indexer, databricks and query api to make sure they don't hardcode the current shard layout
Feb 7th update
- :tada: Resharding took place in testnet after the release :tada:
- validators, rpc and legacy archival nodes went well
- dashboard and logs were good
- split storage archival nodes failed.
- Manual recovery was needed.
- We are fully recovered.
- Added todo items to the list to fix before the mainnet release.
- lake nodes were not updated in time and didn't even enter resharding.
- Manual recovery was required.
- We are fully recovered.
- Issues in node restart midst resharding were discovered in mocknet testing.
- Added todo items to the list to fix before the mainnet release.
- minor improvements
- #10450
- #10541
Feb 26th update
- Issues discovered during resharding in TestNet are being fixed for MainNet release.
- Working with Node team (Marcelo) to continue investigating the issue where epoch transition is suffering after resharding: link
- Both @wacban and @shreyan-gupta were out last week, but are now back to work.
- Node team is preparing pre-release testing for resharding: link
March 11th update
- Resharding plan has evolved and we will perform two shard splits in a row. The detailed timeline is as follows
- [1.37] Mon 2024-03-11 18:00 UTC - 1.37 voting date on mainnet
- [1.38 RC] Mon 2024-03-11 19:00 UTC - 1.38.0-rc.1 release on testnet
- [1.37] Tue 2024-03-12 07:00 UTC - resharding starts on mainnet for shard 3
- [1.37] Tue 2024-03-12 23:00 UTC - start of first epoch with 5 shards on mainnet
- [1.38 RC] Tue 2024-03-12 18:00 UTC - 1.38.0-rc.1 voting on testnet
- [1.38 RC] Wed 2024-03-13 07:00 UTC - resharding starts on testnet for shard 2
- [1.38] Wed 2024-03-13 18:00 UTC - release of 1.38 on mainnet
- [1.38 RC] Wed 2024-03-13 23:00 UTC - start of first epoch with 6 shards on testnet
- [1.38] Mon 2024-03-18 18:00 UTC - 1.38 voting date on mainnet
- [1.38] Tue 2024-03-19 07:00 UTC - resharding starts on mainnet for shard 2
- [1.38] Tue 2024-03-19 23:00 UTC - start of first epoch with 6 shards on mainnet
- To support the initiative above, necessary changes and tests will be made based on the plan below
- [COMPLETED] @wacban checked split boundary of shard 2 and figure out how well it splits the shard.
- @wacban ran the analysis on a fully caught up node and the results are worse now, giving a 27%-73% split. It's still something and it would separate earn.kaiching from game.hot.tg.
- [COMPLETED] @wacban checked split boundary of shard 2 and figure out how well it splits the shard.
Shard: s2.v1
Gas usage: 5709251.48 TGas (54.5% of total)
Number of accounts: 355144
Biggest account: game.hot.tg
Biggest account gas: 2922949.00 TGas (51.2% of shard)
Optimal split:
boundary_account: game.hot.tg
gas(boundary_account): 2922949.00 TGas
Gas distribution (left, boundary_acc, right): (27.8%, 51.2%, 21.0%)
Left (account < boundary_account):
Gas: 1588025.55 TGas (27.8% of shard)
Accounts: 83877
Top 3 accounts:
#1: earn.kaiching
Used gas: 983586.34 TGas (17.2% of shard)
#2: claim.sweat
Used gas: 68254.94 TGas (1.2% of shard)
#3: d08385fb005946d8ee761142123a3dc7610d46787283897c4fcd2584041584ae
Used gas: 24306.55 TGas (0.4% of shard)
Right (account >= boundary_account):
Gas: 4121225.93 TGas (72.2% of shard)
Accounts: 271267
Top 3 accounts:
#1: game.hot.tg
Used gas: 2922949.00 TGas (51.2% of shard)
#2: here.tg
Used gas: 119488.37 TGas (2.1% of shard)
#3: hotwallet.kaiching
Used gas: 49449.18 TGas (0.9% of shard)
- @wacban - add a new integration test for the always congested network
- @shreyan-gupta - add a new set of tests for the V2->V3 resharding
- @wacban will prepare changes need to trigger shard split on TestNet (target: March 8th morning UTC)
- @marcelo-gonzalez will run mocknet to test shard split of shard 2 on top of five shards