ref-fvm icon indicating copy to clipboard operation
ref-fvm copied to clipboard

switch from async to rayon [v3]

Open Stebalien opened this issue 7 months ago • 6 comments

This switches from async rust to using rayon for conformance test parallelism. I'm making this PR against FVM v3 because we currently have conformance tests there.

Motivation:

  • Primary: Remove the need for maintainers to understand/work with complex async code.
  • Secondary: Remove async-std (deprecated).

Performance:

  • Startup performance for the conformance tests is significantly slower (something to do with locking when we compile the built-in actors from multiple threads?).
  • Runtime performance appears to be the same.

Overall, the conformance tests go from 6 to 12 seconds which isn't great (2x) but that extra time appears to be entirely "startup" cost and shouldn't scale with the number of tests.

fixes #2144

Stebalien avatar Apr 22 '25 14:04 Stebalien

@rvagg if this isn't easier to understand, it's not worth it. I thought it was going to be simpler, but error handling with rayon was surprisingly difficult. It might be better to re-try this with channels and a thread-pool, although error handling will still be tricky.

My issue with async-await is that it comes with a bunch of sharp edges around moving things and the error messages can be cryptic.

Stebalien avatar Apr 22 '25 14:04 Stebalien

This all makes sense to me and is much simpler, but I'm not seeing the error handling difficulty this introduces, can you highlight that for me because it looks like normal rust shenanigans to my eyes. Is it the need for Counters that's the hassle?

The biggest delta is that looking at the new code I'm having to take for granted that using a MultiEngine with the specified parallelism actually delivers the parallelism we want, whereas with async/await you can see it because it's local. And the performance hit is a concern, but you're verifying somehow that it's all in startup? Is that from logging output that you're seeing that? If we can confirm that it's startup cost then maybe we're not properly memoising multi-threaded compile and that sounds like an issue we can deal separately to this, then I'm happy with this change.

rvagg avatar May 08 '25 19:05 rvagg

This all makes sense to me and is much simpler, but I'm not seeing the error handling difficulty this introduces, can you highlight that for me because it looks like normal rust shenanigans to my eyes. Is it the need for Counters that's the hassle?

It's normal rust shenanigans. My hope was that rayon would have "try" versions of all their operations that could fail the pipeline early, but no such luck.

The other way to do this is to manually spin up a worker pool and feed the workers with channels (go style). But I'm not sure if that'll be any better.

The biggest delta is that looking at the new code I'm having to take for granted that using a MultiEngine with the specified parallelism actually delivers the parallelism we want, whereas with async/await you can see it because it's local.

I'm not sure I understand:

  1. Before, we constructed a global multi-engine with a specified parallelism.
  2. Now we construct a local multi-engine with the parallelism auto-detected by rayon (usually the number of CPU threads).

And the performance hit is a concern, but you're verifying somehow that it's all in startup? Is that from logging output that you're seeing that?

Yeah, I verified it by logging. We get stuck compiling wasm modules then blow past once we're done.

If we can confirm that it's startup cost then maybe we're not properly memoising multi-threaded compile and that sounds like an issue we can deal separately to this, then I'm happy with this change.

I dug into this more and the issue is that:

  1. Wasmtime is also using rayon to parallelized compilation of single modules.
  2. We're blocking all of the threads in the rayon thread pool waiting on wasmtime to finish compiling the wasm modules.

This is apparently a well-known rayon issue with no good fix. I've tried some nasty yield hacks but... I only managed to shave off 2 seconds when I should be able to shave off 6.

Let's chat about this in person. I think we may just need to shelve this for now.

Stebalien avatar May 08 '25 22:05 Stebalien

@rvagg : did you get to chat in person on this? What are the next steps?

BigLep avatar May 20 '25 05:05 BigLep

yeah, I thought I approved it, the only concern was the time it takes to run it

I think we may just need to shelve this for now.

I'm fine either way with this. It doesn't look like there's any progress on the Rayon issue so maybe that'll go nowhere. I probably need to run this locally to experience the slowdown and see how much of a big deal it is. I think we mostly run these in CI so the extra few seconds probably isn't a big deal.

rvagg avatar May 20 '25 09:05 rvagg

Note: If we merge this, it'll need to be forward-ported to v4 (this is against the v3 branch). I'm ambivalent.

My main concern with it as-is is rayon just doesn't deal with blocking very well.

Stebalien avatar May 20 '25 14:05 Stebalien

2025-06-25 maintainer conversation: @rvagg will try running it himself and do a gut check. If proceed, will then take on the v4 work.

BigLep avatar Jun 25 '25 22:06 BigLep

Testing locally with cargo test --package fvm_conformance_tests, we go from ~42s for release/v3 to ~140s for this branch, which is quite a lot more. It's already slow enough that it's not going to be common for people to run this regularly locally and instead rely on CI to get this done so maybe this isn't a big deal?

rvagg avatar Jun 30 '25 03:06 rvagg

Attempting an alternative here: https://github.com/filecoin-project/ref-fvm/pull/2180 (builds on this commit but winds back some of it so we can just use tokio)

rvagg avatar Jun 30 '25 05:06 rvagg

Handled in https://github.com/filecoin-project/ref-fvm/pull/2180

rvagg avatar Jul 07 '25 06:07 rvagg