sui add profile-transaction to CLI

Description

Adds a command to the CLI for profiling a transaction. The command will generate the profile, and the user will then be able to open it using a flamegraph viewer of choice. Considering that the user may have a different preference of flamegraph tool already installed, and that using speedscope is not specifically required, I would prefer to leave speedscope (and node on which it relies) out of the "prerequisites for sui", especially since it is not a true prerequisite. Another thing I considered here is that a common use pattern will be to make iterative changes and compare profiles, so a user will want to run profile-transaction [digest] -> code changes -> profile-transaction [digest] -> open both files at the same time and compare. Thus leaving these two steps of generating and viewing the profile separate would be helpful for those cases.

I have added a configuration value to enable the profiler to the existing VMProfilerConfig. For now, the profiler can still additionally be enabled via env var method to allow for easy enablement when running tests.

./target/debug/sui client profile-transaction --tx-digest AbfBmKgsx4SaMY3XuNgqZ1DAxD5riLvos6B4hhaMb8y8 will output a profile to the current working directory whereas there is also an option to specify a filepath for the profile output such as

./target/debug/sui client profile-transaction --tx-digest AbfBmKgsx4SaMY3XuNgqZ1DAxD5riLvos6B4hhaMb8y8 -- /Users/laura/temp/my_profile.json

If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process.

Type of Change (Check all that apply)

[ ] protocol change
[ ] user-visible impact
[ ] breaking change for a client SDKs
[ ] breaking change for FNs (FN binary must upgrade)
[ ] breaking change for validators or node operators (must upgrade binaries)
[ ] breaking change for on-chain data layout
[ ] necessitate either a data wipe or data migration

Release notes

Nov 01 '23 08:11 lavindir

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
mysten-ui	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 24, 2024 6:09am
sui-core	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 24, 2024 6:09am
sui-typescript-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 24, 2024 6:09am

3 Ignored Deployments

Name	Status	Preview	Updated (UTC)
explorer	⬜️ Ignored (Inspect)	Visit Preview	Jan 24, 2024 6:09am
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Jan 24, 2024 6:09am
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Jan 24, 2024 6:09am

Nov 01 '23 08:11 vercel[bot]

ready for re-review

Nov 20 '23 07:11 lavindir

Also, I second @tzakian's question on whether it's feasible to add something to ensure sui-node is never built with the profiler enabled.

Yes, I added a warning if it the feature is enabled when running the validator node binary, but we can make a strong guarantee by having it panic instead

Dec 04 '23 06:12 lavindir

The main question I have is for the Cargo.toml files I see a number of places where the fatures are both manually set for the crate being imported and also enabled based on a feature in the crate. My understanding of Cargo features is you should just need to enable them at the bottom (and otherwise they're always turned on I think? But I need to read up on this more).

The declaration at the bottom creates a new feature for the current crate, which is required for every crate that we use #[cfg(feature = "gas-profiler")]. I am happy to go into the configuration in more details, because I know it is really confusing

Dec 13 '23 08:12 lavindir

@amnn The best way to convince yourself that it works is to try it None of this is well documented, and took weeks of trial and error cargo install --locked --git https://github.com/MystenLabs/sui.git --branch laura/add_gas_profiler_to_cli --features gas-profiler sui then sui client profile-transaction --tx-digest AbfBmKgsx4SaMY3XuNgqZ1DAxD5riLvos6B4hhaMb8y8

Dec 14 '23 03:12 lavindir

TL;DR: The feature gate on the CLI is working, but profiling during testing is broken, the error! in sui-node's main is currently silent, and the profiler code is being enabled still in the Move crates because of hakari.

So I pulled down the PR and tried some things. Here's what I found:

After the most recent push, the feature gating is working better for the sui CLI. Prior to it, it was possible to run the gas profiler, even without the feature enabled. This is related to removing the feature tags on the dependencies (so it's driven purely by one feature enabling another. (Note that there are still some straggler cases where dependencies have the feature explicitly enabled).

Profiling a test run is not working (example run shared below). This is related to the issue I mentioned in an earlier comment (the default file path being empty does not work, see the error in the output from the run):

sui$ cargo build --bin sui --features gas-profiler
sui$ cd crates/sui-framework/packages/sui-framework
sui-framework$ env MOVE_VM_PROFILE=1 ~/sui/target/debug/sui move test
... [snip] ...
Unable to create file: Os { code: 2, kind: NotFound, message: "No such file or directory" } panic.file="external-crates/move/crates/move-vm-profiler/src/lib.rs" panic.line=187 panic.column=49

The issue that caused you to make profiler: Option<GasProfiler> not a gated part of sui-types is a symptom of a larger issue, which is that ~hakari hates you, and wishes you ill~ the move crates in workspace-hack still have the gas-profiler feature enabled always, which means that even though you are carefully constructing this tree of gas-profiler feature flags, hakari is merging all potential uses of them together, which is undoing your hard work.

I believe this is happening because they are considered to be external crates, and so hakari is merging all their usages. Additionally, hakari is too clever, and it has spotted that when certain crates within our workspace are compiled with all features enabled, we enable the gas-profiler feature in move crates, so that's why it's being enabled.

The fix would seem to be to incorporate the move crates into our workspace, which was something that @bmwill was looking into. Then our tree of feature flags should do the right thing (I'm inferring this from the fact that sui-types is not included in workspace-hack, and so doesn't have this problem). There may also be something you could do with the hakari config (sui/.config/hakari.toml)? But that's speculation on my part and I'm less certain about that than the "include in the workspace" strategy. I only suggest it because it might be a quicker way to get unblocked for now than to wait for the workspace inclusion work to continue.

Because of the above, I was surprised that sui-node seemed to be running fine despite the check you had added (if that hypothesis was correct, then we should be triggering that error). This was because of two issues:

error! is a tracing message, which means you need to initialise telemetry before we start to see log messages. This is done later on in main than your error message.
In any case, we probably want something more aggressive than an error message, which can be missed. I tried changing the error! into a panic! (see below, I needed to wrap it in if true { ... } because otherwise the compiler complains that the rest of the function is unreachable), and indeed, this code path gets hit when I run:

sui$ cargo run --bin sui-node -- --help

    move_vm_profiler::gas_profiler_feature! {
        if true {
            panic!("Cannot run the sui-node binary with gas-profiler feature enabled");
        }
    }

So the macro approach to checking whether the feature exists is working well, but we need to strengthen the test.

Dec 14 '23 18:12 amnn

I have made the fixes and added a test. @amnn thank you so much for those finds! There was one un-resolved comment thread about the fact that I removed the feature gating from the attributes that have implementations in other crates, had given it another go to add the feature gating there in commit add back feature gating to profiler as struct attribute but then I ran into similar build errors like this https://github.com/MystenLabs/sui/actions/runs/7246236235/job/19737768634, which leads me to believe we should avoid that kind of feature gating that relies on features in other crates being enabled. Also tried adding a second function implementation under #[cfg(not(feature = "gas-profiler"))] but the trait implementation was expecting just one function definition.

cargo tree -p sui-node -f "{p} {f}" now shows that there is no gas-profiler feature anywhere, and if I run the sui-node binary locally it doesn't panic.

Dec 20 '23 13:12 lavindir

The removal of the gating of the trait function members get_profiler and set_profiler will not cause any performance regression because all calls to these two functions remain gated. The profiler struct member above is only accessed via these two functions.

This comment made me think about the fact that in the debug-build gated present state, we were previously not gating any of the macro calls into the profiler from the interpreter. I have gone ahead and gated those as well, which should cause a performance improvement since those calls are the ones that are getting looped-over several times.

Jan 04 '24 01:01 lavindir

I haven't been apart of the discussions on profiling but I'm very hesitant about the path that is being taken here wrt using cargo feature flags. Feature flags tend to work really well for single, stand alone libraries but tend to be much harder to work with when you have a feature that spans numerous crates in a workspace (and in this case across workspaces). This is made worse by the fact that this particular feature is not designed to work with best practices around cargo features, in that features must be additive and that enabling a feature in one crate should not break compilation, or functionality, in another.

I agree with you that this was very hard to get working. The feature is additive in that it will not break compilation or functionality if it is partially enabled, and I have done extensive testing around this to verify both that the feature adheres to the additive best practice, and that if something in the feature tree were to break, the worst case scenario is that the profiler would stop working.

Another large concern of mine is testability. Since compiled code can't be reused, we can't do a single build in CI to test any functionality gated by this feature. This code would likely end up being left untested. We already are doing two full compilations and test runs due to MSIM, if we wanted to test this functionality what would be the plan?

I previously had the sui crate dev dependency enable the feature for testing, which would allow us to run a connectivity test for the feature tree. However, given that we are only doing a single build and want to stick with that, we can set up an actions script that tests builds with the feature enabled and runs this test once every release cycle to make sure we don't release a version of the CLI that has a broken profiler. I would rather maintain that the CI to build the version without the feature enabled. Thoughts on this?

My assumption is that the reason it was decided to use cargo features for this is because we're worried about performance. Have we run any performance evaluations with this logic compiled in and disabled, compiled in and enabled, and compiled out to see what the effect is on performance? In other words is there an material impact on performance with this compiled in but disabled that warrants us going down this path of relying on cargo features for conditional compilation? There's a number of things that can be done to reduce the impact on performance that would likely be easier to maintain than using conditional compilation.

Given the amount of importance we place on performance, having extra data recorded on every move call is not something in my mind that we would choose to accept. I don't have concrete numbers but it would cost O(n) where n is the number of function calls in the transaction. If we look at a few profiles, we can see that this is not a trivial cost.

Jan 12 '24 00:01 lavindir

Thanks for taking a look at this PR @bmwill, your concerns are definitely not unfounded, and we also didn't come to the decision of using features lightly, because we knew this would be the source of some current and ongoing pain.

From my perspective, there are two main reasons why I'm keen to have as little (or none) of the GasProfiler in sui-node:

One is performance, but rather than some desire to maintain some known performance metric, it comes more from a place where we know that the performance of the interpreter is not as good as it could be now, and I don't want that to become a slippery slope towards it being bad forever. I'm sure in this case we could get the cost of the profiler down to a single well-predicted branch for every Move function call -- it's unlikely we would notice that, but that's largely because a call is already quite expensive today, which is something we want to change. The addition of a branch instruction to calls is something that you would typically notice in any other interpreter.
The other is preserving the TCB. Here again, I know that we don't have a thorough firewall around what is and isn't in the TCB, but I'm particularly nervous about introducing something, non-trivial, that was previously only enabled in debug builds, and is related to gas, to the authority codebase. In general, we have a different bar for local tooling vs what goes in the authority codebase.

Additionally, for me, gas-profiling is just the tip of the spear in terms of tools that might require instrumenting the VM for local tooling in a way that we don't want to preserve in the authority codebase (debuggers, tracers, memory profilers, REPLs are all other examples).

However, given that we are only doing a single build and want to stick with that, we can set up an actions script that tests builds with the feature enabled and runs this test once every release cycle to make sure we don't release a version of the CLI that has a broken profiler. I would rather maintain that the CI to build the version without the feature enabled. Thoughts on this?

I like this approach, mainly because it means the "default" configuration that gets tested the most is the one that preserves the TCB.

Re: Buck2 (context -- I used to work on Buck and the migration from Buck v1 to Buck v2 for mobile builds!). I suspect that what we are trying to do here would work much more smoothly in Buck. I would model it as something akin to a build flavour (release, debug, test), which can be toggled using configurations:

We would define a custom build rule, say sui_crate that listens to a configuration that controls whether the target is being built for the TCB or not.
Each crate is responsible for controlling its own build set-up based on these configurations, so the gas profiler could choose whether it was enabled or not based on this tcb configuration -- no more cascade of feature flags.
When we build sui-node we would build under the release and tcb configuration.
When we build sui we would build under the release configuration, but not tcb.
Buck should be able to correctly identify which targets builds change in response to these configs, to maximise re-use of the cache.
We can choose whether we test with tcb enabled or not, or both, given the effects to caching/incrementality should then be much more manageable.

Jan 12 '24 14:01 amnn

I have removed the feature addition to the CLI build as we agreed above, and started a thread in the PE channel for setup of a different task to run the profiler test

Jan 17 '24 06:01 lavindir

fwiw, @amnn is correct re buck2 - we can make it work. I think select() is the magic that makes it work, but tbh, i'm not sure. I have only used select for platform flavors and not this specific use case.

Jan 17 '24 20:01 suiwombat

Agreed! Although I'm wondering if there's any loss to keeping these feature gated? I'd prefer to keep them feature-gated just from a clarity perspective if it's not too hard. But that being said, if it adds difficulty totally OK to keep them ungated.

After doing some investigation on the limitations and recommended usages of features, I have come to the conclusion that these functions cannot be gated. This is because they are part of a trait, which is defined in a different crate than the structs that implement them. If we gate this, during some part of the build process we receive a build error, with either "this function is not part of the trait" or "trait is missing implementation of these functions". The correct usage for features specifies that there should be both no build or logical errors if any single feature is enabled for a crate, regardless of what other features are enabled for other crates.

As far as I can tell this corresponds to gating the calls to profile_open_frame (etc) in the interpreter loop under the gas_profiler_feature_enabled macro but from what I can tell looking at the macros, this corresponds to basically double-feature-gating the code which doesn't seem like it's needed?

I agree, it isn't needed. I will remove them, since if the macros are not even compiled when the flag is off, then that would mean there is no performance optimization to avoid calling and immediately returning from a function.

Jan 23 '24 06:01 lavindir

Clean-ups in #15939

Jan 26 '24 02:01 amnn

sui sui copied to clipboard

add profile-transaction to CLI

Description

Type of Change (Check all that apply)

Release notes

sui
sui copied to clipboard