sui [move] Implement and enable on-chain package hooks

Description

This PR implements and enables on-chain package hooks which makes it possible to build against on-chain dependencies specified in the manifest.

The hooks are implemented as follows:

resolve_on_chain_dependency:
- fetches the dependency from the chain based on the provided package ID
- creates a bytecode package in ~/.move/<id> by generating a manifest and saving the modules
custom_resolve_pkg_id:
- checks for the lockfile and tries to resolve the original-published-id for the current env
- if the lockfile doesn't exist or doesn't have the relevant env entry, we try to resolve by looking up the published-at field in the manifest -- the package is fetched and the id is resolved as the self id of any module in that package (this is also how the on-chain packages generated with resolve_on_chain_dependency are resolved, with caching to avoid duplicate fetching)
- if there's no published-id field either, we assume the package hasn't been published and resolve the id as its manifest name
custom_resolve_version:
- same as above but we resolve the version from the version field in env (in case of lockfile) or the version value from the fetched package (in case of published-at)
- if we can't resolve with above we resolve version as None (see this for more details https://github.com/MystenLabs/sui/blob/main/external-crates/move/crates/move-package/src/resolution/dependency_graph.rs#L135-L152)

Since packages are now resolved by their (original-id, version) pair instead of manifest name, this effectively makes it possible to consolidate both source and on-chain dependencies under a single dependency graph. The reasoning and principles behind this design are outlined here https://gist.github.com/kklas/d9edad7875548fae57b0c86bc8871fea

cc @rvantonder @amnn https://github.com/MystenLabs/sui/pull/14178

Notes and known limitations

The async stuff in SuiPackageHooks is a bit hazardous. The gist of it is that the hook methods are required to be non-async but we need to use the SuiClient within them both for fetching the on-chain packages and resolving package version an ids based on published-at. The way I got around it is that I run the future in tokio::task::block_in_place which will inform the existing executor (in case we're already running within tokio) about the blocking task so it hands of other any tasks to another thread. Then within it we dispatch the future in a separate runtime that we control. This works and it will not starve other tasks but it can lead to weird behaviour, slowness, or even deadlocks in some cases when multiple pieces of code are running concurrently within the same task (which is facilitated by macros like join! and select!). I took a quick look at the codebase and it seems to me it won't be an issue because the sui CLI doesn't use that kind of concurrency for commands, the sui-move-lsp doesn't use tokio / async, and sui-source-validation-service runs all build tasks using tokio::spawn which will be OK. Nonetheless I think the hooks should be refactored to be async ASAP (ideally make them not be global static also)!

Since the packages are now identified by their (original-id, version), the dependency graph section in the lockfiles is now env specific. This means that if the lockfile was generated for the testnet env, having that package as a dependency and building on the mainnet will lead to inconsistencies as testnet ids will be loaded into the graph. Perhaps there should be an optional chain_id field in the lockfile so that it can be ignored when there's a mismatch?

When using on-chain dependencies it's not possible to specify which env they correspond to, so these manifests will only work for a single env.

move-analyzer (to my surprise) works with bytecode deps (based on a smaller project I tested it on, it could have issues elsewhere), but it's not possible to change the env after it starts. Ideally it would pick up changes from WalletContext or have a setting for selecting the env in the editor menu.

Test plan

Added an integration test, tested manually on large-ish projects.

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

[ ] Protocol:
[ ] Nodes (Validators and Full nodes):
[ ] Indexer:
[ ] JSON-RPC:
[ ] GraphQL:
[x] CLI: It's now possible to build against on-chain dependencies using an on-chain dependency entry in Move.toml (e.g. Foo = { id = "0x123" }).
[ ] Rust SDK:
[ ] REST API:

Sep 30 '24 08:09 kklas

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 1, 2024 9:28pm

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Nov 1, 2024 9:28pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Nov 1, 2024 9:28pm
sui-typescript-docs	⬜️ Ignored (Inspect)	Visit Preview	Nov 1, 2024 9:28pm

Sep 30 '24 08:09 vercel[bot]

@amnn I have rebased and fixed clippy, test-extra, and cargo-deny (advisory, fixed by rebase). I can't reproduce TS e2e tests and split cluster check (all pass on my machine) and cargo test (can't reproduce -- consistently have other failures locally even on trunk).

Oct 10 '24 20:10 kklas

Thanks @kklas, some of these tests may be flaky -- I've triggered another run, and we can try retrying them to see.

Oct 11 '24 11:10 amnn

Based on that, outside of dependency resolution, we don't have to worry about this async API complication.

@amnn my understanding is this: I agree we shouldn't have to worry about it, and we don't if the other code that depends on the hooks is not aysnc / concurrent. But if it is then it's an issue because what you're trying to do is run blocking code (because all package hooks functions are non async) that then encapsulates some async code that's IO bound and doing http requests basically. The canonical way to do that (run async code from non-async function) is to run self.handle.block_on (where the handle is another tokio runtime we created for package hooks specifically), but the problem is that tokio doesn't allow running runtimes within runtimes (in case the main is using tokio already) so you will get this error:

thread 'tokio-runtime-worker' panicked at 'Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.

So this is the conundrum and I don't think using channels does anything fundamentally to alleviate that but would rather create more issues -- e.g. the "build dependency graph" call is sync, which will then call into a package hook which will send something to a channel and block waiting for the response. This can cause a deadlock because the hook task might be running on the same thread as the one which is handling the requests from the hooks (and if you want to use tokio channels instead of std then we're back at square one because you can't await in the hook). Seems like an anti-pettern to use channels for this and more complex in general.

After thinking more about this, I can come up with another solution that works:

    pub fn run_async<F>(&self, future: F) -> F::Output
    where
        F: std::future::Future + Send + 'static,
        F::Output: Send + 'static,
    {
        let h = self.handle.clone();
        std::thread::spawn(move || h.block_on(future))
            .join()
            .expect("Thread panicked")
    }

So instead of trying to use the existing runtime with block_in_place which will spawn the task in a spearate "blocking" thread (and which has caveats I outlined in the OP and here), we spawn an isolated thread and do the async stuff in there with a separate runtime. So the advantages of this are:

doesn't rely on Handle::try_current() to detect whether we're running within tokio which is an anti-pattern
doesn't use block_in_place which can cause unexpected behaviour, especially in join! or select! scenarios
spawning a separate thread completely isolates the other runtime

The downside though is the cost of spawning new threads but I guess we can consider this similar to the overhead of running an external resolver command which is a separate process.

Ultimately the issue IMO is the design of the dependency graph. If we want to be able to resolve packages asynchronously, which we do basically because all the sui client that we need are async, the dependency graph needs to support that natively. So with all of this separate thread creation etc. we're just dancing around this issue. Not sure behind the design decision to not have async in the dependency graph but maybe it's time to revisit that. Another issue is PackageHooks are global once registered which also creates issues in concurrent scenarios if you want to run it for different settings concurrently.

Code organisation-wise, we will want to move all this hook stuff out into its own crate

Makes sense! Will do that!

Oct 14 '24 14:10 kklas

So this is the conundrum and I don't think using channels does anything fundamentally to alleviate that but would rather create more issues -- e.g. the "build dependency graph" call is sync, which will then call into a package hook which will send something to a channel and block waiting for the response.

When I originally mentioned using channels, what I had in mind was the following, which I think gets around your concern around the task being potentially scheduled on the same thread as the caller, and your other concern about creating lots of short-lived threads:

When you register the package hook, it also spawns its own dedicated tokio runtime, and a task in that runtime to handle package hook requests.
Inside that runtime, it uses non-blocking tokio channels to receive requests.
Outside the runtime (i.e. in the actual package hooks), blocking calls are made on the other sides of those channels, to make whatever sync code that relies on the package hook wait for the hook to do its work.

You can simulate a function call with channels using a pattern like this:

async fn foo(x: Param) -> Ret { 
    /* ... */
    bar
}

Becomes (roughly -- I haven't run this):


static ACTOR: OnceCell<Actor> = OnceCell::new();

struct Actor {
    call: mpsc::Sender<(Param, oneshot::Receiver<Ret>)>,
    rt: Runtime,
}

fn foo(x: Param) -> Result<Ret> {
    let call = CALL.get().clone();
    let (tx, rx) = oneshot::channel();
    blocking_send((x, tx))?;
    rx.blocking_recv()
}

fn register_hook() {
    let rt = Builder::new_multi_thread()
        .enable_all()
        .build()
        .expect("Couldn't set-up runtime for actor");

    // Support some moderate level of concurrency (buffer for four 
    // requests).
    let (tx, rx) = mpsc::channel(4);

    // NB. This will complain if you register the hook twice, so test 
    // scenarios may be complicated.
    ACTOR.set(Actor { call: tx, rt }).expect("Failed to set-up actor");

    rt.spawn(async move {
        loop {
            let (x, tx) = rx.recv().await.expect("tx can't close");
            /* ... */

            // Ignore the error from this -- it means the rx side was dropped,
            // but that is not this actor's concern.
            let _ = tx.send(bar).await;
        }
    });
}

Tokio has some resources on this problem: https://tokio.rs/tokio/topics/bridging -- this approach is the one suggested here:

Sending messages

The third technique is to spawn a runtime and use message passing to communicate with it. This involves a bit more boilerplate than the other two approaches, but it is the most flexible approach. [...] This example could be configured in many ways. For instance, you could use a Semaphore to limit the number of active tasks, or you could use a channel in the opposite direction to send a response to the spawner. When you spawn a runtime in this way, it is a type of actor.

But to your point here:

Ultimately the issue IMO is the design of the dependency graph. If we want to be able to resolve packages asynchronously, which we do basically because all the sui client that we need are async, the dependency graph needs to support that natively. So with all of this separate thread creation etc. we're just dancing around this issue. Not sure behind the design decision to not have async in the dependency graph but maybe it's time to revisit that. Another issue is PackageHooks are global once registered which also creates issues in concurrent scenarios if you want to run it for different settings concurrently.

Yes, we do need to redesign dependency resolution with all its new constraints in mind. Package Hooks predate Sui, which uses quite a different philosophy to packages compared to what came before it. Once MVR and this work have landed, I'm hoping we can pick up that redesign work internally.

Oct 16 '24 11:10 amnn

@amnn and I continued the async discussion over another channel and agreed that the solution I implemented here is a reasonable one considering the trade-offs.

Nov 01 '24 23:11 kklas

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jan 01 '25 02:01 github-actions[bot]