snarkVM Only keep track of recently used stacks in memory.

Motivation

RSS growth is correlated with deployments. By lazy loading deployments, hopefully unbounded memory growth is eliminated.

The cache will take up at most MAX_PROGRAM_DEPTH x MAX_IMPORTS x 100kb x 10 ~= 4GB of data.

Note that programs, represented as Stacks, have imports. This means the cache is essentially a DAG with multiple roots. To avoid memory leaks, we only evict root stacks from the cache. In the following diagram, you can see in the call graph below where the cache is (temporarily) locked and updated. The "long" loop has some nasty edge cases, it would be better to pass all imports directly into load_deployment and Stack::new, but that would be a much bigger refactor.

Test Plan

[ ] Unit tests pass. Requires https://github.com/ProvableHQ/snarkVM/pull/2590 to fix the test_real_example_cache_evict test.
[ ] Local network passes, using tx-cannon to deploy and fetch multiple sets of maximally nested deployments.
[ ] Deployed network passes, using tx-cannon to deploy and fetch multiple sets of maximally nested deployments.
[ ] Consider fixing how test storage works, which currently leaks states across tests.

Related PRs

https://github.com/ProvableHQ/snarkVM/pull/2519 https://github.com/ProvableHQ/snarkVM/pull/2553 https://github.com/ProvableHQ/snarkVM/pull/2578

Jan 10 '25 16:01 vicsn

Tested it on a 5 validator network, with 4 instance doing nested deployments, 4 instances doing nested executions, and a program probe instance running every 500 ms (total ~100 program deployments). Validators reliably halted after ~70 deployments.

For testing, added the locktick features to snarkVM (commit c12cbf66d334159108d75f85d910d073420869b8 here) and snarkOS (commit 246996bb2cdbc0dc75d4029b21aad1d32db6af2d here). This revealed that locks are held for a long time (>2s) for deep nested programs.

Example logs:

2025-02-13T09:54:21.240609Z TRACE snarkos: [locktick] checking for active lock guards
2025-02-13T09:54:21.240971Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/ledger/src/advance.rs@93:33 (Write): 304; 1 active; avg d: 2.09742095s; avg w: 58ns
2025-02-13T09:54:21.240999Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/synthesizer/src/vm/finalize.rs@591:27 (Write): 305; 1 active; avg d: 2.065852041s; avg w: 79ns
2025-02-13T09:54:21.241006Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/synthesizer/src/vm/finalize.rs@554:28 (Lock): 305; 1 active; avg d: 2.06651786s; avg w: 99ns
2025-02-13T09:54:21.241015Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/synthesizer/src/vm/mod.rs@321:27 (Lock): 305; 1 active; avg d: 2.097369753s; avg w: 97ns
2025-02-13T09:54:21.241165Z TRACE snarkos: /Users/kp/dev/stress-observability/test_suites/single-region-tests/playbooks/snarkos-shallow/246996bb2cdbc0dc75d4029b21aad1d32db6af2d/node/bft/src/bft.rs@459:38 (Lock): 2590; 1 active; avg d: 6.866325ms; avg w: 201ns

Some fix ideas:

use try_lock
find a minimal reproduction case
print when we call contains_program_in_cache and get_stack so we understand where the "pressure" is

Feb 13 '25 14:02 kpandl

How relevant/applicable is this with the new Process + Stack changes?

I assume the optimizations here using caches may still be useful, but there are quite a number of merge conflicts. In addition, there is now the assumption of upgradable stacks due to program upgradability.

Jul 29 '25 20:07 raychu86