sui
sui copied to clipboard
add profile-transaction to CLI
Description
Adds a command to the CLI for profiling a transaction. The command will generate the profile, and the user will then be able to open it using a flamegraph viewer of choice. Considering that the user may have a different preference of flamegraph tool already installed, and that using speedscope is not specifically required, I would prefer to leave speedscope (and node on which it relies) out of the "prerequisites for sui", especially since it is not a true prerequisite. Another thing I considered here is that a common use pattern will be to make iterative changes and compare profiles, so a user will want to run profile-transaction [digest] -> code changes -> profile-transaction [digest] -> open both files at the same time and compare. Thus leaving these two steps of generating and viewing the profile separate would be helpful for those cases.
I have added a configuration value to enable the profiler to the existing VMProfilerConfig. For now, the profiler can still additionally be enabled via env var method to allow for easy enablement when running tests.
./target/debug/sui client profile-transaction --tx-digest AbfBmKgsx4SaMY3XuNgqZ1DAxD5riLvos6B4hhaMb8y8
will output a profile to the current working directory whereas there is also an option to specify a filepath for the profile output such as
./target/debug/sui client profile-transaction --tx-digest AbfBmKgsx4SaMY3XuNgqZ1DAxD5riLvos6B4hhaMb8y8 -- /Users/laura/temp/my_profile.json
If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process.
Type of Change (Check all that apply)
- [ ] protocol change
- [ ] user-visible impact
- [ ] breaking change for a client SDKs
- [ ] breaking change for FNs (FN binary must upgrade)
- [ ] breaking change for validators or node operators (must upgrade binaries)
- [ ] breaking change for on-chain data layout
- [ ] necessitate either a data wipe or data migration
Release notes
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
mysten-ui | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Jan 24, 2024 6:09am |
sui-core | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Jan 24, 2024 6:09am |
sui-typescript-docs | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Jan 24, 2024 6:09am |
3 Ignored Deployments
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
explorer | ⬜️ Ignored (Inspect) | Visit Preview | Jan 24, 2024 6:09am | |
multisig-toolkit | ⬜️ Ignored (Inspect) | Visit Preview | Jan 24, 2024 6:09am | |
sui-kiosk | ⬜️ Ignored (Inspect) | Visit Preview | Jan 24, 2024 6:09am |
ready for re-review
Also, I second @tzakian's question on whether it's feasible to add something to ensure
sui-node
is never built with the profiler enabled.
Yes, I added a warning if it the feature is enabled when running the validator node binary, but we can make a strong guarantee by having it panic instead
The main question I have is for the Cargo.toml files I see a number of places where the fatures are both manually set for the crate being imported and also enabled based on a feature in the crate. My understanding of Cargo features is you should just need to enable them at the bottom (and otherwise they're always turned on I think? But I need to read up on this more).
The declaration at the bottom creates a new feature for the current crate, which is required for every crate that we use
#[cfg(feature = "gas-profiler")]
. I am happy to go into the configuration in more details, because I know it is really confusing
@amnn The best way to convince yourself that it works is to try it
None of this is well documented, and took weeks of trial and error
cargo install --locked --git https://github.com/MystenLabs/sui.git --branch laura/add_gas_profiler_to_cli --features gas-profiler sui
then
sui client profile-transaction --tx-digest AbfBmKgsx4SaMY3XuNgqZ1DAxD5riLvos6B4hhaMb8y8
TL;DR: The feature gate on the CLI is working, but profiling during testing is broken, the error!
in sui-node
's main
is currently silent, and the profiler code is being enabled still in the Move crates because of hakari.
So I pulled down the PR and tried some things. Here's what I found:
After the most recent push, the feature gating is working better for the sui
CLI. Prior to it, it was possible to run the gas profiler, even without the feature enabled. This is related to removing the feature tags on the dependencies (so it's driven purely by one feature enabling another. (Note that there are still some straggler cases where dependencies have the feature explicitly enabled).
Profiling a test run is not working (example run shared below). This is related to the issue I mentioned in an earlier comment (the default file path being empty does not work, see the error in the output from the run):
sui$ cargo build --bin sui --features gas-profiler
sui$ cd crates/sui-framework/packages/sui-framework
sui-framework$ env MOVE_VM_PROFILE=1 ~/sui/target/debug/sui move test
... [snip] ...
Unable to create file: Os { code: 2, kind: NotFound, message: "No such file or directory" } panic.file="external-crates/move/crates/move-vm-profiler/src/lib.rs" panic.line=187 panic.column=49
The issue that caused you to make profiler: Option<GasProfiler>
not a gated part of sui-types
is a symptom of a larger issue, which is that ~hakari hates you, and wishes you ill~ the move crates in workspace-hack still have the gas-profiler
feature enabled always, which means that even though you are carefully constructing this tree of gas-profiler
feature flags, hakari is merging all potential uses of them together, which is undoing your hard work.
I believe this is happening because they are considered to be external crates, and so hakari is merging all their usages.
Additionally, hakari is too clever, and it has spotted that when certain crates within our workspace are compiled with all features enabled, we enable the gas-profiler
feature in move crates, so that's why it's being enabled.
The fix would seem to be to incorporate the move crates into our workspace, which was something that @bmwill was looking into. Then our tree of feature flags should do the right thing (I'm inferring this from the fact that sui-types
is not included in workspace-hack
, and so doesn't have this problem). There may also be something you could do with the hakari config (sui/.config/hakari.toml
)? But that's speculation on my part and I'm less certain about that than the "include in the workspace" strategy. I only suggest it because it might be a quicker way to get unblocked for now than to wait for the workspace inclusion work to continue.
Because of the above, I was surprised that sui-node
seemed to be running fine despite the check you had added (if that hypothesis was correct, then we should be triggering that error). This was because of two issues:
-
error!
is a tracing message, which means you need to initialise telemetry before we start to see log messages. This is done later on inmain
than your error message. - In any case, we probably want something more aggressive than an error message, which can be missed. I tried changing the
error!
into apanic!
(see below, I needed to wrap it inif true { ... }
because otherwise the compiler complains that the rest of the function is unreachable), and indeed, this code path gets hit when I run:
sui$ cargo run --bin sui-node -- --help
move_vm_profiler::gas_profiler_feature! {
if true {
panic!("Cannot run the sui-node binary with gas-profiler feature enabled");
}
}
So the macro approach to checking whether the feature exists is working well, but we need to strengthen the test.
I have made the fixes and added a test. @amnn thank you so much for those finds! There was one un-resolved comment thread about the fact that I removed the feature gating from the attributes that have implementations in other crates, had given it another go to add the feature gating there in commit add back feature gating to profiler as struct attribute but then I ran into similar build errors like this https://github.com/MystenLabs/sui/actions/runs/7246236235/job/19737768634, which leads me to believe we should avoid that kind of feature gating that relies on features in other crates being enabled. Also tried adding a second function implementation under #[cfg(not(feature = "gas-profiler"))] but the trait implementation was expecting just one function definition.
cargo tree -p sui-node -f "{p} {f}"
now shows that there is no gas-profiler feature anywhere, and if I run the sui-node binary locally it doesn't panic.
The removal of the gating of the trait function members get_profiler and set_profiler will not cause any performance regression because all calls to these two functions remain gated. The profiler struct member above is only accessed via these two functions.
This comment made me think about the fact that in the debug-build gated present state, we were previously not gating any of the macro calls into the profiler from the interpreter. I have gone ahead and gated those as well, which should cause a performance improvement since those calls are the ones that are getting looped-over several times.
I haven't been apart of the discussions on profiling but I'm very hesitant about the path that is being taken here wrt using cargo feature flags. Feature flags tend to work really well for single, stand alone libraries but tend to be much harder to work with when you have a feature that spans numerous crates in a workspace (and in this case across workspaces). This is made worse by the fact that this particular feature is not designed to work with best practices around cargo features, in that features must be additive and that enabling a feature in one crate should not break compilation, or functionality, in another.
I agree with you that this was very hard to get working. The feature is additive in that it will not break compilation or functionality if it is partially enabled, and I have done extensive testing around this to verify both that the feature adheres to the additive best practice, and that if something in the feature tree were to break, the worst case scenario is that the profiler would stop working.
Another large concern of mine is testability. Since compiled code can't be reused, we can't do a single build in CI to test any functionality gated by this feature. This code would likely end up being left untested. We already are doing two full compilations and test runs due to MSIM, if we wanted to test this functionality what would be the plan?
I previously had the sui crate dev dependency enable the feature for testing, which would allow us to run a connectivity test for the feature tree. However, given that we are only doing a single build and want to stick with that, we can set up an actions script that tests builds with the feature enabled and runs this test once every release cycle to make sure we don't release a version of the CLI that has a broken profiler. I would rather maintain that the CI to build the version without the feature enabled. Thoughts on this?
My assumption is that the reason it was decided to use cargo features for this is because we're worried about performance. Have we run any performance evaluations with this logic compiled in and disabled, compiled in and enabled, and compiled out to see what the effect is on performance? In other words is there an material impact on performance with this compiled in but disabled that warrants us going down this path of relying on cargo features for conditional compilation? There's a number of things that can be done to reduce the impact on performance that would likely be easier to maintain than using conditional compilation.
Given the amount of importance we place on performance, having extra data recorded on every move call is not something in my mind that we would choose to accept. I don't have concrete numbers but it would cost O(n) where n is the number of function calls in the transaction. If we look at a few profiles, we can see that this is not a trivial cost.
Thanks for taking a look at this PR @bmwill, your concerns are definitely not unfounded, and we also didn't come to the decision of using features lightly, because we knew this would be the source of some current and ongoing pain.
From my perspective, there are two main reasons why I'm keen to have as little (or none) of the GasProfiler in sui-node
:
- One is performance, but rather than some desire to maintain some known performance metric, it comes more from a place where we know that the performance of the interpreter is not as good as it could be now, and I don't want that to become a slippery slope towards it being bad forever. I'm sure in this case we could get the cost of the profiler down to a single well-predicted branch for every Move function call -- it's unlikely we would notice that, but that's largely because a call is already quite expensive today, which is something we want to change. The addition of a branch instruction to calls is something that you would typically notice in any other interpreter.
- The other is preserving the TCB. Here again, I know that we don't have a thorough firewall around what is and isn't in the TCB, but I'm particularly nervous about introducing something, non-trivial, that was previously only enabled in debug builds, and is related to gas, to the authority codebase. In general, we have a different bar for local tooling vs what goes in the authority codebase.
Additionally, for me, gas-profiling is just the tip of the spear in terms of tools that might require instrumenting the VM for local tooling in a way that we don't want to preserve in the authority codebase (debuggers, tracers, memory profilers, REPLs are all other examples).
However, given that we are only doing a single build and want to stick with that, we can set up an actions script that tests builds with the feature enabled and runs this test once every release cycle to make sure we don't release a version of the CLI that has a broken profiler. I would rather maintain that the CI to build the version without the feature enabled. Thoughts on this?
I like this approach, mainly because it means the "default" configuration that gets tested the most is the one that preserves the TCB.
Re: Buck2 (context -- I used to work on Buck and the migration from Buck v1 to Buck v2 for mobile builds!). I suspect that what we are trying to do here would work much more smoothly in Buck. I would model it as something akin to a build flavour (release
, debug
, test
), which can be toggled using configurations:
- We would define a custom build rule, say
sui_crate
that listens to a configuration that controls whether the target is being built for the TCB or not. - Each crate is responsible for controlling its own build set-up based on these configurations, so the gas profiler could choose whether it was enabled or not based on this
tcb
configuration -- no more cascade of feature flags. - When we build
sui-node
we would build under therelease
andtcb
configuration. - When we build
sui
we would build under therelease
configuration, but nottcb
. - Buck should be able to correctly identify which targets builds change in response to these configs, to maximise re-use of the cache.
- We can choose whether we test with
tcb
enabled or not, or both, given the effects to caching/incrementality should then be much more manageable.
I have removed the feature addition to the CLI build as we agreed above, and started a thread in the PE channel for setup of a different task to run the profiler test
fwiw, @amnn is correct re buck2 - we can make it work. I think select()
is the magic that makes it work, but tbh, i'm not sure. I have only used select for platform flavors and not this specific use case.
Agreed! Although I'm wondering if there's any loss to keeping these feature gated? I'd prefer to keep them feature-gated just from a clarity perspective if it's not too hard. But that being said, if it adds difficulty totally OK to keep them ungated.
After doing some investigation on the limitations and recommended usages of features, I have come to the conclusion that these functions cannot be gated. This is because they are part of a trait, which is defined in a different crate than the structs that implement them. If we gate this, during some part of the build process we receive a build error, with either "this function is not part of the trait" or "trait is missing implementation of these functions". The correct usage for features specifies that there should be both no build or logical errors if any single feature is enabled for a crate, regardless of what other features are enabled for other crates.
As far as I can tell this corresponds to gating the calls to profile_open_frame (etc) in the interpreter loop under the gas_profiler_feature_enabled macro but from what I can tell looking at the macros, this corresponds to basically double-feature-gating the code which doesn't seem like it's needed?
I agree, it isn't needed. I will remove them, since if the macros are not even compiled when the flag is off, then that would mean there is no performance optimization to avoid calling and immediately returning from a function.
Clean-ups in #15939