Reusable module cache
This is the stellar-core side of a soroban change to surface the module cache for reuse.
On the core side we:
- Add a new mechanism to scan the live BL snapshot for contracts alone (which is a small part of the BL)
- Add a new wrapper type
SorobanModuleCacheto the core-side Rust code, which holds asoroban_env_host::ModuleCachefor each host protocol version we support caching modules for (there is currently only one but there will be more in the future) - Also add a
CoreCompilationContexttype tocontract.rswhich carries aBudgetand logs errors to the core console logging system. This is sufficient to allow operating thesoroban_env_host::ModuleCachefrom outside theHost. - Pass a
SorobanModuleCacheinto the host function invocation path that core calls during transactions. - Store a
SorobanModuleCachein theLedgerManagerImplthat is long-lived, spans ledgers. - Add a utility class
SharedModuleCacheCompilerthat does a multithreaded load-all-contracts / populate-the-module-cache, and call this on startup when theLedgerManagerImplrestores its LCL. - Add a couple command-line helpers that list and forcibly compile all contracts (we might choose to remove these, they were helpful during debugging but at this point perhaps redundant)
The main things left to do here are:
- [x] Maybe move all this to the soon-to-be-added
p23soroban submodule - [x] Decide how we're going to handle growing the cache with new uploads and wire that up
- [x] Similarly decide how we're going to handle expiry and restoration events as discussed in CAP0062 and wire that up too
- [x] Write some tests if I can think of any more specific than just "everything existing still works"
- [x] Write a CAP describing this
I think that's .. kinda it? The reusable module cache is just "not passed in" on p22 and "passed in" on p23, so it should just start working at the p23 boundary.
Updated with the following:
- Rebased
- Adapted to simplified module cache in recently-merged upstream
- Moved soroban changes to p23 submodule
- Incremental expiry and addition of new contracts after ledger close
- Less-chatty logging
- Added support for populating module cache for side-by-side testing with
SOROBAN_TEST_EXTRA_PROTOCOL=23
Updated with fix for test failure when running with background close, as well as all review comments addressed (I think!)
I think there's a bug in the
ApplyBucketswork path. Currently, we use the BucketList snapshot to compile WASM. Problem is, we callcompileAllContractsInLedgerinAssumeStateWork. After callingapp.getBucketManager().assumeState, the raw BucketList is up to date, but we don't actually update the BucketList snapshots until afterAssumeStateWorkis down (seeLedgerManagerImpl.cpp:392). I'm pretty sure we'd currently only compile state based on the genesis ledger when we hit the apply bucket path.
Ah! You're right.
That path is actually only there to try to make compilation happen "post catchup", even though as it stands it is also triggered on the startup AssumeState as well. Incorrectly (as you mention) and redundantly with the explicit call to compileAllContractsInLedger below, at the bottom of LedgerManagerImpl::loadLastKnownLedger.
I think the fix here is to remove the call to compileAllContractsInLedger from AssumeStateWork altogether, and just do it in LedgerApplyManager when it decides the CatchupWork is complete. Then there won't be a redundant (and wrong!) compile happening during startup, and there will be a non-redundant (and hopefully right!) compile happening during catchup.
I think there's a bug in the
ApplyBucketswork path. Currently, we use the BucketList snapshot to compile WASM. Problem is, we callcompileAllContractsInLedgerinAssumeStateWork. After callingapp.getBucketManager().assumeState, the raw BucketList is up to date, but we don't actually update the BucketList snapshots until afterAssumeStateWorkis down (seeLedgerManagerImpl.cpp:392). I'm pretty sure we'd currently only compile state based on the genesis ledger when we hit the apply bucket path.Ah! You're right.
That path is actually only there to try to make compilation happen "post catchup", even though as it stands it is also triggered on the startup
AssumeStateas well. Incorrectly (as you mention) and redundantly with the explicit call tocompileAllContractsInLedgerbelow, at the bottom ofLedgerManagerImpl::loadLastKnownLedger.I think the fix here is to remove the call to
compileAllContractsInLedgerfromAssumeStateWorkaltogether, and just do it inLedgerApplyManagerwhen it decides theCatchupWorkis complete. Then there won't be a redundant (and wrong!) compile happening during startup, and there will be a non-redundant (and hopefully right!) compile happening during catchup.
Hmm I think we still have to compile before catch-up though (or at least before applying ledgers in catch-up)? I assume (perhaps incorrectly) that we have asserts that compilation is always cached and will remove the functionality to lazily compile modules during tx invocation in p23. TX replay would break these invariants.
Hmm I think we still have to compile before catch-up though (or at least before applying ledgers in catch-up)? I assume (perhaps incorrectly) that we have asserts that compilation is always cached and will remove the functionality to lazily compile modules during tx invocation in p23. TX replay would break these invariants.
Hmmmm maybe! I am unclear on where we should compile during catchup, then. Perhaps in LedgerManagerImpl::setLastClosedLedger, at the end?
(In terms of asserts: there is no assert in the code currently that "every contract is cached" before we call a txn; fortunately or unfortunately, soroban will not-especially-gracefully degrade to making throwaway modules as needed. We could put an assert on the core side, and I guess we should since it represents a bug-that-turns-into-a-performance-hazard)
Hmmmm maybe! I am unclear on where we should compile during catchup, then. Perhaps in LedgerManagerImpl::setLastClosedLedger, at the end?
Yeah, that makes sense to me. Compilation is basically part of assuming the current ledger state, so seems like a reasonable place to put it.
Yeah, that makes sense to me. Compilation is basically part of assuming the current ledger state, so seems like a reasonable place to put it.
Done.
Rebased around @SirTyson's latest BL changes, repaired tests, integrated bits of residual feedback, squashed and cleaned up dead/redundant states, and picked out what I think were all the build-breaking bugs.
The last one (in 1bb1b8dc2dcc7346b41db63ede1183421f4337e0) is one I suspect @SirTyson might disagree with; I do not like it either, and am very open to other options here! In general I do not understand, despite a lot of code-reading, why there are static members of LoadGenerator rather than just member variables.
The presence of the statics means that global state of the LoadGenerator carries forward from one test in the unit testsuite to another. This in turn means that depending on how tests are partitioned by the test runner (which changes as we add tests to the testsuite) tests that rely on that state break or un-break (in particular the REQUIRE_THROWS_WITH check in testcase "generate soroban load" of LoadGeneratorTests.cpp breaks or un-breaks depending on test order).
I think this is not ideal! Resetting the LoadGenerator statics on construction, as I'm doing in this change, seems like the least we could do. And seems (when I test it locally) to be enough to get tests passing. For now. But it also seems wrong. Like: why have static state at all? Is there no way to get rid of it?
In general I do not understand, despite a lot of code-reading, why there are static members of LoadGenerator rather than just member variables.
Yeah, I couldn't immediately tell either. Moreover, we seem to reset that 'soroban state' quite in a few places already. I suspect this works around some very particular scenarios that could just be fixed by preserving the loadgen instance in-between loadgen calls.
In general I do not understand, despite a lot of code-reading, why there are static members of LoadGenerator rather than just member variables.
Yeah, I couldn't immediately tell either. Moreover, we seem to reset that 'soroban state' quite in a few places already. I suspect this works around some very particular scenarios that could just be fixed by preserving the loadgen instance in-between loadgen calls.
I talked to @SirTyson about this this morning and he thinks it was just a misunderstanding -- like when it was written he was somehow thinking that LoadGenerator wasn't long-lived or something. He believes it should be fine to remove.
After that discussion, I removed the static qualifiers and at least locally on my system the tests all pass. Pushed update with that change. I guess if it works, we can just see if we rediscover some reason for those qualifiers in the future?
@SirTyson I believe this is ready to go again, though .. if we're holding back until post-release, it can continue to wait. But I'm not worried about its snapshot logic anymore!