cadence Combine non-atree domain payloads into atree payloads

Why

We can reduce memory, storage use, etc. by combining non-atree domain registers into 1 atree payload per account. We can further reduce execution state size by using atree register inlining feature added in Atree v0.8.0.

This improvement can be deployed via HCU and implements first on-the-fly migration to avoid causing further downtime.

The full impact on memory reduction (TBD) will be realized after all the accounts eventually migrate all their non-atree domain registers to a single combined atree payload per account.

NOTE: Memory reduction is only 1 benefit from reducing payload count.

How will we measure success ?

After all the non-atree domain registers are migrated to 1 atree map per account we expect:

reduce execution state size by 10-12% (~40GB as of Sept. 2024 mainnet state)
Less network bandwidth needed for chunk data packs (TBD)
reduce payload count by 20-30% (by combining payloads)
reduce mtrie nodes by 20-30% (by eliminating mtrie nodes)
reduce EN memory use by TBD (usually better than state size reduction)

Requires HCU. If we use on-the-fly migration, then the full impact won't be realized until all accounts are migrated.

Requires "cleanup" migration for accounts that were not migrated on-the-fly.

Estimate

~3dev/weeks to get ready for deployment.

DACI

Role	Assigned
Driver	Technical: @fxamacker , EM: @j1010001
Approver	@dete
Consulted	@ramtinms, @zhangchiqing, @turbolent
Informed	Flow engineering team, Product team: @vishalchangrani, Marketing: @SeanRobb

Details

New domains were added in Cadence 1.0 and domain register count increased to 150 million (was 80 million on Sept. 4 pre-spork). Each 8-byte domain register also requires ~2 mtrie nodes (~2x96 byte overhead).

Nearly 25% of total payloads are 8-byte domain registers.

These 8-byte "domain registers" refers to 8-byte account domain payloads. They are not atree payloads.

As of Sept 10, 2024, mainnet state had:

total payload count: 610 million
atree payload count: 324,800,993
domain register (aka payload) count: 150,127,970
etc.

Suggestion

Currently, all 150+ million account domain registers are not using atree.

I think we should replace non-atree domain registers with 1 atree register (atree map) per account. Currently, there are 35 million accounts.

Also, we can use atree inlining to inline small domain data. So instead of reducing payload count from 150 million to 35 million, we can reduce payload count by 174 million.

DRAFT Estimates

Based on preliminary estimates using Sept. 17, 2024 mainnet state, I think we can:

eliminate mtrie nodes: -425 million (-28.5%)
reduce payload count: -174 million (-28.5%)
reduce register reads (inlined domain data)
reduce state size by ~40GB (roughly 10-12% of state size)
reduce peak RAM use by TBD (usually better than state size)
provide single point of entry for account data

We can reduce payload count by 28% (instead of being limited to 24%) due to atree inlining.

Misc

This issue replaces my older issue #2867.

EDIT: add DRAFT estimates and mention why we can reduce 174 million payload count if there are only 150 million domain registers.

Effort Estimation

2dev/weeks to wrap-up testing and open PR for review. Will need @turbolent to review.

Sep 13 '24 16:09 fxamacker

I was asked about impact of this, so I used mainnet state (Sept. 17 checkpoint) to create DRAFT estimates:

eliminate mtrie nodes: -425 million (-28.5%)
reduce payload count: -174 million (-28.5%)
reduce register reads (inlined domain data)
reduce state size -40GB and peak RAM by more
provide single point of entry for account data

We can reduce payload count by 28% (instead of being limited to 24%) due to atree inlining.

Reduction in payload count and mtrie nodes isn't as big as atree v.0.8.0 (Sept. 4, 2024) but goals are similar.

In general, benefits from reducing payload counts and eliminating mtrie nodes include:

Slow down growth rate of execution state.
Reduce peak RAM use on execution nodes.
Reduce SSD storage use (e.g. checkpoint file sizes).
Speedup network upgrades (e.g. future migrations processing fewer payloads).
Future efficiency of other servers, components, databases (files/cache/index), etc. that benefit from fewer payloads.

Each coding change can't always eliminate 1 billion mtrie nodes, but smaller reductions can eventually become greater.

Sep 26 '24 19:09 fxamacker

it seems like there is also 140M registers ( ~ 4 per account ) used by flow-go, ( account status, keys etc ) maybe it can be good to consider them also. Would be nice to migrate everything to atree. ( as flow-go now uses atree anyway for EVM stuff )

Sep 29 '24 12:09 bluesign

it seems like there is also 140M registers ( ~ 4 per account ) used by flow-go, ( account status, keys etc ) maybe it can be good to consider them also.

Yes, definitely worth considering! :+1:

Oct 03 '24 14:10 fxamacker

I replaced text promising 25% reduction of EN memory (that estimate isn't from me).

Payload count can be reduced by 20-30% by combining their content, so their content is not lost. However, we can safely eliminate 20-30% of mtrie nodes (96 bytes each).

[!Note] If we use on-the-fly migration, then the full impact won't be realized until all accounts are eventually migrated.

Unless we do a full migration, we won't immediately see the full benefits.

After all the non-atree domain registers are migrated, we should get:

reduced payload count by 20-30%
reduced mtrie count by 20-30%
reduced execution state size by roughly 40GB (given Sept. 2024 mainnet state size)
reduced EN memory use by TBD (usually more than state size reduction)

While at it, I listed other benefits (memory reduction is only 1 benefit). Also, more than EN can benefit when we reduce state size by reducing payload counts, etc.

Oct 07 '24 22:10 fxamacker

On mainnet, average number of domains per account is just over 4.

For each new account having 4 domains, I think we would save around 1152 bytes.

Oct 10 '24 21:10 fxamacker

To test the impact of PR #3664, I created a full migration program and ran it on mainnet data (Nov 1, 2024 state).

The full migration program on a test vm produced expected results for reducing number of payloads and mtrie nodes by 28.7% each. 🎉

However, the plan is to deploy PR #3664 as HCU and use its on-the-fly migration feature (no downtime), so the full impact won't be seen until all accounts (including read-only and idle accounts) are eventually migrated.

Nov 05 '24 18:11 fxamacker

Progress Update (Nov 4-15)

On Nov 4 and Nov 13, we ran full migrations with mainnet data and test results were better than estimated. They also satisfy the "How do we measure success" criteria. All full migration test runs detected no problems 🎉 and Nov 13 results show:

-11.4% state size reduction
-28.8% reduction in payload count
-28.8% reduction in mtrie node count

On Nov 5, PR reviews started for this OKR.

On Nov 12, we merged all PRs 🎉 (to feature branch) for the approved work for using HCU to migrate accounts having write ops.

We merged PR 3664 (+5026 lines) to feature branch. This PR was opened on Oct 31, requires HCU, and migrates accounts with write ops as discussed in Sept. It includes MigrateAccount() and MigrateAccounts(). The caller can provide any criteria for migrating each account by passing a pred function to MigrateAccounts().
We replaced domain name string with integer (Bastian's suggestion) reduced state size by extra -1.2GB (-0.3%). Old tech debt was a blocker to this extra optimization, so it was fixed first (23 files changed in master branch).

The remaining work of refactoring to make code easier to understand was combined with new requirements, since replacing HCU with feature flag would also require refactoring work that was done.

New Requirements (HCU vs Feature Flag)

On Nov 5, we started discussing and exploring possibility of using feature flag to replace HCU requirement.

On Nov 12, Bastian opened a draft PR 3678 to explore replacing HCU requirement with feature flag, as well as other worthwhile changes discussed since Nov 5.

By Nov 15, most of the feedback for the new PR 3678 were already incorporated:

#3678
#3680

Many thanks to Cadence team for reviews and amazing work by Bastian 🙌 on these new PRs. Extra work and findings in this project should benefit other projects in the future that involve feature flags or on-the-fly migrations.

Nov 18 '24 18:11 fxamacker

@j1010001 I don't have any more PRs to open (unless something comes up) for OKR issue 3584 after PR 6882 gets merged.

https://github.com/onflow/flow-go/pull/6882

PR 6882 can be used to report the counts of v1 and v2 account formats.

You mentioned this issue 3584 is largely done for a while (except for integration testing, etc. being done by others). So I'm focused on reviewing PRs for team and will resume my other OKR issues.

Jan 30 '25 15:01 fxamacker

I removed "Less network bandwidth needed for chunk data packs (TBD)" because I didn't add it to Success Criteria list. Seems risky to add to Success Criteria after work was already approved.

However, we can add it as a Stretch Goal if others are convinced.

Feb 04 '25 20:02 fxamacker

I added @janezpodhostnik to the DACI Consulted (aka Contributors) list. He provided answers and valuable insights when asked about fees, etc. and also contributed in other ways. 👍

Feb 06 '25 16:02 fxamacker

UPDATE: With Jan 31, 2025 mainnet state, this optimization reduces state size by 11.1%, eliminates 28.2% of mtrie vertexes (leaf and non-leaf nodes), and also reduces payload count by 28.2%.

Our success criteria was to reduce state size by 10-12%, payload count by 25-30%, and eliminate 25-30% of mtrie vertexes. 🎉

After optimizing, we also added support for handling both v1 and v2 account storage formats which is required for zero-downtime (aka on-the-fly) migrations.

Feb 21 '25 15:02 fxamacker

Closing this as completed.

Zero-downtime (aka on-the-fly) migrations completed on mainnet. All accounts are using storage account format v2.

Storage health check, etc. didn't report any new problems on mainnet (Feb 24, 2025 snapshot):

https://github.com/onflow/ff-sre-infrastructure/pull/194

Feb 26 '25 03:02 fxamacker

Awesome work @fxamacker!

Feb 26 '25 17:02 turbolent

Awesome work everyone! This was initially an idea for a regular migration and then it became more ambitious (HCU, feature flag, on-the-fly migration) yet the team managed to make everything go smoothly and deployment was a success.

Register and payload counts grow daily (normal data growth), so comparisons of register counts or checkpoint sizes that span weeks, etc. should adjust for this aspect (to avoid accidentally shortchanging actual results of this optimization).

Mar 06 '25 18:03 fxamacker

I'm told this reduced peak RAM use on Execution nodes by ~30% (roughly 3x the state size reduction). 🎉

Apr 25 '25 23:04 fxamacker