ref-fvm [Investigation] Reported performance regression when updating wasmtime from 19 to 24.

We've seen a reported performance regression that appears to be associated with the wasmtime update that happened in FVM 4.4 (wasmtime 19 -> 24). This was an interesting performance regression because:

It shows up even when updating FVM 2.7 to 2.9 even when wasmtime 2.9 isn't being used, just included. This indicates that the issue is in some common wasmtime dependency.
None of the gas benchmarks in FVM v4 are picking up the regression, but those benchmarks mostly focus on non-wasm things. Importantly, I'm reasonably sure that:
1. It's not related to hashing.
2. It's not related to signature verification.
3. It's not related to calling into wasm contracts (see #2057).
4. It's not related to trapping and running out of gas (we have an explicit benchmark for this because it ocasionally rears its head and turns into an $N^2$ problem).
5. It shouldn't be cranelift. That tends to be version-locked with a specific wasmtime version.

Things I haven't checked:

The cost of wasm instructions in general. We priced wasm instructions long ago and it was a bit of a manual process.
The cost of jumping in and out of wasm. Wasmtime has made some recent changes here however, I'd expect any issues there to be specific to the wasmtime version in-use (observation (1) above appears to conflict with this).

All I can think of now is... rustix? Or something like that?

Oct 15 '24 22:10 Stebalien

My next step is to try to update FVM v2 in forest to see what deps get updated along with it.

Oct 15 '24 22:10 Stebalien

We've confirmed that having wasmtime v24 anywhere in the build tree, even if unused, causes this slowdown.

Oct 16 '24 19:10 Stebalien

It's not the write_core feature in object as far as I can tell.
I'm also seeing some changes in gimli, but v24 uses a different version of that so it shouldn't matter.

Interesting, I'm seeing smallvec now depending on serde. I can't imagine how that might be relevant, but I'll need to check.

Oct 16 '24 21:10 Stebalien

I've tracked it down to wasmtime's cranelift feature (does not appear to be feature unification but I may be missing something). I'm now trying with just wasmtime-cranelift.

Oct 17 '24 00:10 Stebalien

wasmtime-environ doesn't reproduce.
cranelift-wasm does.

Oct 17 '24 16:10 Stebalien

cranelift-codegen reproduces, trying cranelift-control now

Oct 17 '24 17:10 Stebalien

cranelift-control does not. Trying cranelift-codegen-meta, cranelift-codegen-shared, and cranelift-isle

Oct 17 '24 17:10 Stebalien

Ok, it is cranelift-codegen specifically. Now I'm trying to see if it's because we're not supporting some architecture by building with CARGO_FEATURE_ALL_ARCH.... we'll see if that even works.

Oct 17 '24 18:10 Stebalien

Er, trying with reduced features first. Only "std" and "unwind" because those are required.

Oct 17 '24 18:10 Stebalien

Ok, still broken with those features. Now I'm trying with "all arch" to see if it's some kind if isle issue (I think it is, but I'm not sure if that's the fix).

Oct 17 '24 18:10 Stebalien

Enabling all architectures doesn't help, and I can't skip the ISLE build (that option only exists if it's pre-built). Now I'm bisecting cranelift-codegen, starting with 0.109.

Oct 17 '24 19:10 Stebalien

Ok, it's cranelift-codegen 0.107.0 exactly. Now I'm testing a build with all the FVM crates fully updated to get us all on a single wasmtime version just in case it's an issue with multiple wasmtime versions.

Oct 17 '24 22:10 Stebalien

Ok, it is feature unification. Specifically adding the trace-log feature to regalloc2.

Oct 17 '24 23:10 Stebalien

My hypothesis is that this is actually just a compilation slowdown, not an execution slowdown. We're only running through ~24 epochs here and will have to pause to lazily (IIRC?) compile actors every time we load a new one. This is my guess because I'm noticing that epochs 1 & 4 are taking a while.

Additionally, if we're speeding through multiple network upgrades (especially if we're switching FVM versions), we'll have to re-compile the actors per network version.

Oct 17 '24 23:10 Stebalien

@Stebalien This could be verified by running a longer benchmark without network upgrades. The .env. file would need to be modified along the lines of:

LOTUS_IMAGE=ghcr.io/chainsafe/lotus-devnet:2024-10-10-600728e
FOREST_DATA_DIR=/forest_data
LOTUS_DATA_DIR=/lotus_data
FIL_PROOFS_PARAMETER_CACHE=/var/tmp/filecoin-proof-parameters
MINER_ACTOR_ADDRESS=f01000
LOTUS_RPC_PORT=1234
LOTUS_P2P_PORT=1235
MINER_RPC_PORT=2345
FOREST_RPC_PORT=3456
FOREST_OFFLINE_RPC_PORT=3457
F3_RPC_PORT=23456
F3_FINALITY=100000
GENESIS_NETWORK_VERSION=24
SHARK_HEIGHT=-10
HYGGE_HEIGHT=-9
LIGHTNING_HEIGHT=-8
THUNDER_HEIGHT=-7
WATERMELON_HEIGHT=-6
DRAGON_HEIGHT=-5
WAFFLE_HEIGHT=-4
TUKTUK_HEIGHT=-3
TARGET_HEIGHT=200

Note that the timeout would also need to be extended. Then, we compare the timings before and after the FVM upgrade.

In that way, we will be able to confirm if the slowdown is coming from re-compiling actors (which is not a big issue).

Oct 18 '24 07:10 LesnyRumcajs

@Stebalien I believe your hypothesis holds. I tried on the config above locally (on a machine that previously reported 50% slowdown).

FVM 4.3.1
________________________________________________________
Executed in  781.81 secs    fish           external
   usr time    5.92 secs    0.00 micros    5.92 secs
   sys time    5.95 secs  696.00 micros    5.95 secs

FVM 4.4
________________________________________________________
Executed in  782.58 secs    fish           external
   usr time    5.81 secs    0.00 micros    5.81 secs
   sys time    6.01 secs  811.00 micros    6.01 secs

So I guess it's "okay-ish" in a sense that the slowdown only occurs under some synthetic conditions. That said, it might make sense to report it to the cranelift maintainers.

Oct 18 '24 10:10 LesnyRumcajs

Yep, I plan on reporting it upstream. I'm also looking into possibly disabling the offending feature (requires upstream help, but I'm guessing it's only needed for GC which we don't use).

Also, for forest, you can modify the "quick" compilation profile to optimize cranelift-codegen (or maybe just regalloc2). That didn't get it to be quite fast enough to pass the test in the 5m timeout, but it definitely improved the situation.

I'm also going to write a quick "load/compile the actors" benchmark.

Oct 18 '24 13:10 Stebalien

Oh, lol, no. This trace-log is literally just for debugging. I don't think this was supposed to get shipped to production.

Oct 18 '24 13:10 Stebalien

I don't think this was supposed to get shipped to production.

Nvm:

# Note that this just enables `trace-log` for `clif-util` and doesn't turn it on
# for all of Cranelift, which would be bad.

It looks like they thought this wouldn't have a performance impact.

Oct 18 '24 13:10 Stebalien

Ah, that comment is saying that trace-log is only enabled for clif-util.

And... already fixed but not in the version we're using: https://github.com/bytecodealliance/wasmtime/pull/9128. Time to update to v25, I guess.

Oct 18 '24 14:10 Stebalien

ref-fvm ref-fvm copied to clipboard

[Investigation] Reported performance regression when updating wasmtime from 19 to 24.

ref-fvm
ref-fvm copied to clipboard