stellar-core icon indicating copy to clipboard operation
stellar-core copied to clipboard

Reusable module cache

Open graydon opened this issue 1 year ago • 12 comments

This is the stellar-core side of a soroban change to surface the module cache for reuse.

On the core side we:

  1. Add a new mechanism to scan the live BL snapshot for contracts alone (which is a small part of the BL)
  2. Add a new wrapper type SorobanModuleCache to the core-side Rust code, which holds a soroban_env_host::ModuleCache for each host protocol version we support caching modules for (there is currently only one but there will be more in the future)
  3. Also add a CoreCompilationContext type to contract.rs which carries a Budget and logs errors to the core console logging system. This is sufficient to allow operating the soroban_env_host::ModuleCache from outside the Host.
  4. Pass a SorobanModuleCache into the host function invocation path that core calls during transactions.
  5. Store a SorobanModuleCache in the LedgerManagerImpl that is long-lived, spans ledgers.
  6. Add a utility class SharedModuleCacheCompiler that does a multithreaded load-all-contracts / populate-the-module-cache, and call this on startup when the LedgerManagerImpl restores its LCL.
  7. Add a couple command-line helpers that list and forcibly compile all contracts (we might choose to remove these, they were helpful during debugging but at this point perhaps redundant)

The main things left to do here are:

  • [x] Maybe move all this to the soon-to-be-added p23 soroban submodule
  • [x] Decide how we're going to handle growing the cache with new uploads and wire that up
  • [x] Similarly decide how we're going to handle expiry and restoration events as discussed in CAP0062 and wire that up too
  • [x] Write some tests if I can think of any more specific than just "everything existing still works"
  • [x] Write a CAP describing this

I think that's .. kinda it? The reusable module cache is just "not passed in" on p22 and "passed in" on p23, so it should just start working at the p23 boundary.

graydon avatar Jan 11 '25 01:01 graydon

Updated with the following:

  • Rebased
  • Adapted to simplified module cache in recently-merged upstream
  • Moved soroban changes to p23 submodule
  • Incremental expiry and addition of new contracts after ledger close
  • Less-chatty logging
  • Added support for populating module cache for side-by-side testing with SOROBAN_TEST_EXTRA_PROTOCOL=23

graydon avatar Jan 23 '25 06:01 graydon

Updated with fix for test failure when running with background close, as well as all review comments addressed (I think!)

graydon avatar Jan 24 '25 04:01 graydon

I think there's a bug in the ApplyBuckets work path. Currently, we use the BucketList snapshot to compile WASM. Problem is, we call compileAllContractsInLedger in AssumeStateWork. After calling app.getBucketManager().assumeState, the raw BucketList is up to date, but we don't actually update the BucketList snapshots until after AssumeStateWork is down (see LedgerManagerImpl.cpp:392). I'm pretty sure we'd currently only compile state based on the genesis ledger when we hit the apply bucket path.

Ah! You're right.

That path is actually only there to try to make compilation happen "post catchup", even though as it stands it is also triggered on the startup AssumeState as well. Incorrectly (as you mention) and redundantly with the explicit call to compileAllContractsInLedger below, at the bottom of LedgerManagerImpl::loadLastKnownLedger.

I think the fix here is to remove the call to compileAllContractsInLedger from AssumeStateWork altogether, and just do it in LedgerApplyManager when it decides the CatchupWork is complete. Then there won't be a redundant (and wrong!) compile happening during startup, and there will be a non-redundant (and hopefully right!) compile happening during catchup.

graydon avatar Jan 25 '25 01:01 graydon

I think there's a bug in the ApplyBuckets work path. Currently, we use the BucketList snapshot to compile WASM. Problem is, we call compileAllContractsInLedger in AssumeStateWork. After calling app.getBucketManager().assumeState, the raw BucketList is up to date, but we don't actually update the BucketList snapshots until after AssumeStateWork is down (see LedgerManagerImpl.cpp:392). I'm pretty sure we'd currently only compile state based on the genesis ledger when we hit the apply bucket path.

Ah! You're right.

That path is actually only there to try to make compilation happen "post catchup", even though as it stands it is also triggered on the startup AssumeState as well. Incorrectly (as you mention) and redundantly with the explicit call to compileAllContractsInLedger below, at the bottom of LedgerManagerImpl::loadLastKnownLedger.

I think the fix here is to remove the call to compileAllContractsInLedger from AssumeStateWork altogether, and just do it in LedgerApplyManager when it decides the CatchupWork is complete. Then there won't be a redundant (and wrong!) compile happening during startup, and there will be a non-redundant (and hopefully right!) compile happening during catchup.

Hmm I think we still have to compile before catch-up though (or at least before applying ledgers in catch-up)? I assume (perhaps incorrectly) that we have asserts that compilation is always cached and will remove the functionality to lazily compile modules during tx invocation in p23. TX replay would break these invariants.

SirTyson avatar Jan 25 '25 01:01 SirTyson

Hmm I think we still have to compile before catch-up though (or at least before applying ledgers in catch-up)? I assume (perhaps incorrectly) that we have asserts that compilation is always cached and will remove the functionality to lazily compile modules during tx invocation in p23. TX replay would break these invariants.

Hmmmm maybe! I am unclear on where we should compile during catchup, then. Perhaps in LedgerManagerImpl::setLastClosedLedger, at the end?

graydon avatar Jan 25 '25 01:01 graydon

(In terms of asserts: there is no assert in the code currently that "every contract is cached" before we call a txn; fortunately or unfortunately, soroban will not-especially-gracefully degrade to making throwaway modules as needed. We could put an assert on the core side, and I guess we should since it represents a bug-that-turns-into-a-performance-hazard)

graydon avatar Jan 25 '25 01:01 graydon

Hmmmm maybe! I am unclear on where we should compile during catchup, then. Perhaps in LedgerManagerImpl::setLastClosedLedger, at the end?

Yeah, that makes sense to me. Compilation is basically part of assuming the current ledger state, so seems like a reasonable place to put it.

SirTyson avatar Jan 25 '25 01:01 SirTyson

Yeah, that makes sense to me. Compilation is basically part of assuming the current ledger state, so seems like a reasonable place to put it.

Done.

graydon avatar Jan 25 '25 02:01 graydon

Rebased around @SirTyson's latest BL changes, repaired tests, integrated bits of residual feedback, squashed and cleaned up dead/redundant states, and picked out what I think were all the build-breaking bugs.

The last one (in 1bb1b8dc2dcc7346b41db63ede1183421f4337e0) is one I suspect @SirTyson might disagree with; I do not like it either, and am very open to other options here! In general I do not understand, despite a lot of code-reading, why there are static members of LoadGenerator rather than just member variables.

The presence of the statics means that global state of the LoadGenerator carries forward from one test in the unit testsuite to another. This in turn means that depending on how tests are partitioned by the test runner (which changes as we add tests to the testsuite) tests that rely on that state break or un-break (in particular the REQUIRE_THROWS_WITH check in testcase "generate soroban load" of LoadGeneratorTests.cpp breaks or un-breaks depending on test order).

I think this is not ideal! Resetting the LoadGenerator statics on construction, as I'm doing in this change, seems like the least we could do. And seems (when I test it locally) to be enough to get tests passing. For now. But it also seems wrong. Like: why have static state at all? Is there no way to get rid of it?

graydon avatar Feb 08 '25 06:02 graydon

In general I do not understand, despite a lot of code-reading, why there are static members of LoadGenerator rather than just member variables.

Yeah, I couldn't immediately tell either. Moreover, we seem to reset that 'soroban state' quite in a few places already. I suspect this works around some very particular scenarios that could just be fixed by preserving the loadgen instance in-between loadgen calls.

dmkozh avatar Feb 10 '25 19:02 dmkozh

In general I do not understand, despite a lot of code-reading, why there are static members of LoadGenerator rather than just member variables.

Yeah, I couldn't immediately tell either. Moreover, we seem to reset that 'soroban state' quite in a few places already. I suspect this works around some very particular scenarios that could just be fixed by preserving the loadgen instance in-between loadgen calls.

I talked to @SirTyson about this this morning and he thinks it was just a misunderstanding -- like when it was written he was somehow thinking that LoadGenerator wasn't long-lived or something. He believes it should be fine to remove.

After that discussion, I removed the static qualifiers and at least locally on my system the tests all pass. Pushed update with that change. I guess if it works, we can just see if we rediscover some reason for those qualifiers in the future?

graydon avatar Feb 10 '25 23:02 graydon

@SirTyson I believe this is ready to go again, though .. if we're holding back until post-release, it can continue to wait. But I'm not worried about its snapshot logic anymore!

graydon avatar Feb 15 '25 03:02 graydon