bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Maybe pipelined rendering

Open hymm opened this issue 2 years ago • 10 comments

Objective

  • Implement pipelined rendering
  • Fixes #5082

User Facing Description

Bevy now implements piplelined rendering! Pipelined rendering allows the app logic and rendering logic to run on different threads leading to large gains in performance.

image tracy capture of many_foxes example

To use pipelined rendering, you just need to add the PipelinedRenderingPlugin. If you're using DefaultPlugins then it will automatically be added for you on all platforms except wasm. Bevy does not currently support multithreading on wasm which is needed for this feature to work. If you aren't using DefaultPlugins you can add the plugin manually.

use bevy::prelude::*;
use bevy::render::pipelined_rendering::PipelinedRenderingPlugin;

fn main() {
    App::new()
        // whatever other plugins you need
        .add_plugin(RenderPlugin)
        // needs to be added after RenderPlugin
        .add_plugin(PipelinedRenderingPlugin)
        .run();
}

If for some reason pipelined rendering needs to be removed. You can also disable the plugin the normal way.

use bevy::prelude::*;
use bevy::render::pipelined_rendering::PipelinedRenderingPlugin;

fn main() {
    App::new.add_plugins(DefaultPlugins.disable::<PipelinedRenderingPlugin>());
}

A setup function was added to plugins

A optional plugin lifecycle function was added to the Plugin trait. This function is called after all plugins have been built, but before the app runner is called. This allows for some final setup to be done. In the case of pipelined rendering, the function removes the sub app from the main app and sends it to the render thread.

struct MyPlugin;
impl Plugin for MyPlugin {
    fn build(&self, app: &mut App) {
        
    }
    
    // optional function
    fn setup(&self, app: &mut App) {
        // do some final setup before runner is called
    }
}

A Stage for Frame Pacing

In the RenderExtractApp there is a stage labelled BeforeIoAfterRenderStart that systems can be added to. The specific use case for this stage is for a frame pacing system that can delay the start of main app processing in render bound apps to reduce input latency i.e. "frame pacing". This is not currently built into bevy, but exists as bevy

|-------------------------------------------------------------------|
|         | BeforeIoAfterRenderStart | winit events | main schedule |
| extract |---------------------------------------------------------|
|         | extract commands | rendering schedule                   |
|-------------------------------------------------------------------|

Small API additions

  • Schedule::remove_stage
  • App::insert_sub_app
  • App::remove_sub_app
  • TaskPool::scope_with_executor

Problems and Solutions

Moving render app to another thread

Most of the hard bits for this were done with the render redo. This PR just sends the render app back and forth through channels which seems to work ok. I originally experimented with using a scope to run the render task. It was cuter, but that approach didn't allow render to start before i/o processing. So I switched to using channels. There is much higher complexity in the coordination that needs to be done, but it's worth it. By moving rendering during i/o processing the frame times should be much more consistent in render bound apps. See https://github.com/bevyengine/bevy/issues/4691.

Unsoundness with Sending World with NonSend resources

Dropping !Send things on threads other than the thread they were spawned on is considered unsound. The render world doesn't have any nonsend resources. So if we tell the users to "pretty please don't spawn nonsend resource on the render world", we can avoid this problem.

More seriously there is this https://github.com/bevyengine/bevy/pull/6534 pr, which patches the unsoundness by aborting the app if a nonsend resource is dropped on the wrong thread. That PR should probably be merged before this one. For a longer term solution we have this discussion going https://github.com/bevyengine/bevy/discussions/6552.

NonSend Systems in render world

The render world doesn't have any !Send resources, but it does have a non send system. While Window is Send, winit does have some API's that can only be accessed on the main thread. prepare_windows in the render schedule thus needs to be scheduled on the main thread. Currently we run nonsend systems by running them on the thread the TaskPool::scope runs on. When we move render to another thread this no longer works.

To fix this, a new scope_with_executor method was added that takes a optional TheadExecutor that can only be ticked on the thread it was initialized on. The render world then holds a MainThreadExecutor resource which can be passed to the scope in the parallel executor that it uses to spawn it's non send systems on.

Scopes executors between render and main should not share tasks

Since the render world and the app world share the ComputeTaskPool. Because scope has executors for the ComputeTaskPool a system from the main world could run on the render thread or a render system could run on the main thread. This can cause performance problems because it can delay a stage from finishing. See https://github.com/bevyengine/bevy/pull/6503#issuecomment-1309791442 for more details.

To avoid this problem, TaskPool::scope has been changed to not tick the ComputeTaskPool when it's used by the parallel executor. In the future when we move closer to the 1 thread to 1 logical core model we may want to overprovide threads, because the render and main app threads don't do much when executing the schedule.

Performance

My machine is Windows 11, AMD Ryzen 5600x, RX 6600

Examples

This PR with pipelining vs Main

Seeing a perf gain from 46% on many foxes to 5% on many cubes.

  percent     Diff     Main     PR    
tracy frame time mean median sigma mean median sigma mean median sigma mean median sigma
many foxes 41.70% 42.12% 27.81% 6.33 6.25 3.09 15.18 14.84 11.11 8.85 8.59 8.02
many lights 29.68% 29.25% 29.85% 4.41 4.27 1.94 14.86 14.6 6.5 10.45 10.33 4.56
many animated sprites 15.38% 15.23% -3.06% 4.51 4.39 -0.14 29.33 28.82 4.58 24.82 24.43 4.72
3d scene 11.64% 10.69% 18.21% 0.78 0.7 0.59 6.7 6.55 3.24 5.92 5.85 2.65
many cubes 8.16% 7.66% 3.33% 1.83 1.69 0.32 22.43 22.06 9.62 20.6 20.37 9.3
many sprites 5.61% 6.56% -11.06% 1.5 1.73 -0.47 26.73 26.36 4.25 25.23 24.63 4.72

This PR with pipelining disabled vs Main

Not sure I 100% understand these changes, but they're possibly due to 2 things. First is that there is 1 less thread for executing systems compute tasks now that the main thread doesn't execute them. Second is that the scope executor changed away from try_tick.

  percent     Diff     Main     PR no pipelining    
tracy frame time mean median sigma mean median sigma mean median sigma mean median sigma
many foxes 1.78% 1.62% -2.25% 0.27 0.24 -0.25 15.18 14.84 11.11 14.91 14.6 11.36
many lights -0.61% -1.03% 16.92% -0.09 -0.15 1.1 14.86 14.6 6.5 14.95 14.75 5.4
many animated sprites 1.77% 1.60% -10.48% 0.52 0.46 -0.48 29.33 28.82 4.58 28.81 28.36 5.06
3d scene -1.64% -2.75% -19.14% -0.11 -0.18 -0.62 6.7 6.55 3.24 6.81 6.73 3.86
many cubes -1.29% -1.63% -2.70% -0.29 -0.36 -0.26 22.43 22.06 9.62 22.72 22.42 9.88
many sprites 2.02% 1.90% -3.06% 0.54 0.5 -0.13 26.73 26.36 4.25 26.19 25.86 4.38

Benchmarks

TODO

Migration Guide

SubApp runner has conceptually been changed to an extract function.

The runner no longer is in charge of running the sub app schedule. It's only concern is now moving data between the main world and the sub app. The sub_app.app.schedule is now run for you after the provided function is called.

// before
fn main() {
    let sub_app = App::empty();
    sub_app.add_stage(MyStage, SystemStage::parallel());
    
    App::new().add_sub_app(MySubApp, sub_app, move |main_world, sub_app| {
        extract(app_world, render_app);
        render_app.app.schedule.run();
    });
}

// after
fn main() {
        let sub_app = App::empty();
    sub_app.add_stage(MyStage, SystemStage::parallel());
    
    App::new().add_sub_app(MySubApp, sub_app, move |main_world, sub_app| {
        extract(app_world, render_app);
        // schedule is automatically called for you after extract is run
    });
}

App runner and SubApp extract functions are now required to be Send

This was done to make App and SubApp Send. If this breaks your use case please report it as these new bounds might be able to be relaxed.

ToDo

  • [ ] benchmarking
  • [x] check examples perf with pipelining disabled
  • [ ] test wasm still works

hymm avatar Nov 07 '22 00:11 hymm

Fix tests. A bunch of tests are broken because tests don't run on the main thread. Need to come up with a workaround for this.

You should be able to set the number of threads the tests are run with to 1 to force this to work :)

alice-i-cecile avatar Nov 07 '22 01:11 alice-i-cecile

TIL that cfg(test) only works on the current workspace being tested. I tried to revert to the previous behavior of scope when testing (each instance of scope has it's own scope executor), but this only works for bevy_tasks when using cfg[test]. The only solution I've thought of so far is to put the code behind a feature flag instead. This would mean that any time a user wants to run tests on an App or Stage they'll need to enable the feature. This may be confusing as the failure mode currently is that the test deadlocks and times out. Maybe we could add a warning/panic for when test is enabled, but the feature flag is not on.

hymm avatar Nov 07 '22 06:11 hymm

I think my plan for now is to change the MainThreadExecutor into a resource so it's not global and pass it into scope as an optional argument. A little less ergonomic API wise, but probably better than asking users to turn on a feature flag for testing.

edit: also a step backwards from global task pools, but ¯\(ツ)

hymm avatar Nov 07 '22 17:11 hymm

Would we still want a task pool scope to run the render world if !Send resources didn't exist?

Or would we just spawn a thread for the render world and park it until some signal arrives? (I would think so since ideally nothing blocks the main thread's event loop. We'd want app world running in another thread too.)

maniwani avatar Nov 09 '22 14:11 maniwani

So I changed the approach a good bit. It's now spawning a long running task that is controlled through channels that send the rendering app back and forth. This change allows the rendering app to run during input processing and will help with more consistent frame times when the app is gpu bound.

Would we still want a task pool scope to run the render world if !Send resources didn't exist?

Not sure. Right now prepare_windows is the only thing in the render schedule that needs to run on the main thread. It's not really using a !Send resource other than as a marker to force the system to run on the main thread. I haven't looked into what it's doing to know if this could be changed somehow.

hymm avatar Nov 09 '22 22:11 hymm

When testing many_foxes I was getting really bad frame times when the animation_player (which is part of PostUpdate) ended up running on the render thread. This delays the render thread from making progress until that system is done running.

image

As a quick fix I've removed ticking the task stealing executors from scope. Seems to not hurt execution time on many_foxes and makes the tail latency much better. Below is a histogram of frame times. Yellow is before removing executor.run and red is with executor.run removed. You can see that this removes the outer peaks from the distribution with the mean remaining mostly the same.

image

This probably hurts the micro benchmarks a lot, but I still need to run them and test some other examples

hymm avatar Nov 10 '22 05:11 hymm

Taking this out of draft as the code is mostly ready. Still need to think about making inserting NonSend Resources unsafe still and investigate perf more.

hymm avatar Nov 10 '22 07:11 hymm

Holding off on approval until we have a solid grasp on the direction of how we're dealing with !Send soundness in World.

Should we prioritize refactoring these winit interactions so they don't use !Send resources?

With the ThreadExecutor here (maybe a simpler, non-async version) and some NonSend container that stores all the !Send resources, we could proxy all !Send data accesses behind them.

Like if NonSend was stored on App, and we left it on the main thread at startup, World can be Send. And we don't have to limit systems to a particular thread, just blocks of code that read NonSend.

The main thread is just running an event loop, so systems in other threads can send FnMut(NonSend) objects to the main thread executor, and they'll be processed at some point in the loop. Doesn't even have to be async. If the main thread is just winit and we kick the "main" and render worlds to their own dedicated threads, then the event loop can run as fast as necessary.

maniwani avatar Nov 10 '22 22:11 maniwani

I'd rather us just have a dedicated thread for running each app (main and subapp), and have them all use the ComputeTaskPool for their systems/tasks.

A separate thread doesn't help the problem. The ComputeTaskPool::scope will still run systems from the other app on it's thread.

hymm avatar Nov 15 '22 06:11 hymm

Removed the render task pool and changed the ComputeTaskPool scope to not tick the global executors when running the parallel executor. Perf is roughly the same.

hymm avatar Nov 15 '22 23:11 hymm

@hymm can you add Fixes #4718? The shift away from spin-ticking to using Executor::run should resolve it.

james7132 avatar Dec 04 '22 03:12 james7132

FYI some combination of the changes in this PR + the changes to scope on main reintroduced the flaky test problem in bevy_transform.

hymm avatar Dec 12 '22 20:12 hymm

So the flakiness was very similar to before. It got reintroduced by transform propagation becoming multithreaded and me removing the catch unwind. So the parallel transform propagation in panicking test can poison the scope in another test. I pushed a fix where I tried to keep the executor.run's, but I'm not sure if the AssertUnwindSafe's are sound or not. Need to think about it more. Also should probably bench it just to see if it has much effect. I don't think it should as it's outside the hot loop.

hymm avatar Dec 14 '22 04:12 hymm

ran a quick test with many_foxes and it seems to have improved the frame time a decent amount. Not sure what that's about.

image

hymm avatar Dec 15 '22 05:12 hymm

@alice-i-cecile I would still mark this as blocked until #6534 or something equivalent is merged. This implementation is unsound unless we're 100% certain World: Send.

james7132 avatar Dec 15 '22 18:12 james7132

So more about AssertUnwindSafe. When I remove the AssertUnwindSafe, I get this note with the errors

 = note: `UnwindSafe` is implemented for `&[closure@async_executor::Runner<'_>::runnable::{closure#0}::{closure#0}]`, but not for `&mut [closure@async_executor::Runner<'_>::runnable::{closure#0}::{closure#0}]`

but not sure currently what is taking the Runner as &mut instead of &

hymm avatar Dec 16 '22 03:12 hymm

I'm not sure if the AssertUnwindSafe's are sound or not.

Unwind safety is about not leaving something in a invalid state due to an unwind, not memory safety and undefined behavior. Given that async_executor is entirely safe code (sans the task spawning from async_task), we cannot be left in a state that could trigger UB outside of our own use of unsafe.

This generally seems OK? See https://stackoverflow.com/questions/65762689/how-can-assertunwindsafe-be-used-with-the-catchunwind-future. The issue seems to be from the FnMut closure passed to https://github.com/smol-rs/async-executor/blob/master/src/lib.rs#L659, everything else involved uses immutable references. Sans potential lock poisoning, I don't think it's possible to leave async_executor in an invalid state.

james7132 avatar Dec 19 '22 02:12 james7132

Remember to update the PR title :)

alice-i-cecile avatar Jan 11 '23 00:01 alice-i-cecile

I wanted to check that my assumption that switching to async executor's run was still worth it over try_tick in the scope, since there have been a decent number of changes in this PR and main since I did that.

For the most part there are clear gains to using run over try_tick, except for many_cubes which has a small gain on the mean and a small loss on the median. I'm going to dig a little into the traces to see if I can understand what's happening there, but I don't expect that to change the PR.

Negative numbers means the current PR version was faster.

  percent     Diff     PR run scope     PR try_tick scope    
tracy frame time mean median sigma mean median sigma mean median sigma mean median sigma
many foxes -5.39% -6.80% 37.05% -0.23 -0.28 2.06 4.27 4.12 5.56 4.5 4.4 3.5
many lights -5.91% -6.77% 25.90% -0.43 -0.48 1.51 7.27 7.09 5.83 7.7 7.57 4.32
many animated sprites -4.11% -2.23% 4.34% -0.96 -0.5 0.37 23.33 22.4 8.52 24.29 22.9 8.15
3d scene -14.18% -11.19% -93.01% -0.2 -0.15 -1.73 1.41 1.34 1.86 1.61 1.49 3.59
many cubes -0.77% 0.57% -19.17% -0.11 0.08 -1.48 14.2 14 7.72 14.31 13.92 9.2
many sprites -3.26% -1.54% 44.27% -0.73 -0.33 5.91 22.37 21.45 13.35 23.1 21.78 7.44

I still need to redo these for current main and with pipelining disabled. Once I get those numbers I'll update the tables in the description.

hymm avatar Jan 12 '23 18:01 hymm

Updated the PR description.

except for many_cubes which has a small gain on the mean and a small loss on the median

Most of the loss seems to be in extract and applying the commands from extract. I do get why having another scope active with try_tick is a problem though.

image

As seen in the above image the scope running on the render thread is running tasks as fast as prepare_systems can make them. This is leading to the other threads being starved of tasks and also slowing down prepare_systems due to increased contention on the global queue (which is mostly empty). Not 100% sure why we we don't see the benefit of this as much on many_cubes. Might just be things not lining up timing wise and extract being a little more single threaded.

hymm avatar Jan 13 '23 18:01 hymm

(sorry my vscode got overexcited, i'm still going through)

robtfm avatar Jan 16 '23 22:01 robtfm

@aevyrie do you see anything in this PR that is strongly blocking your considerations around pacing? If you see that things can be reworked or adjusted later for that kind of work, and I believe that should be possible one way or another, then I feel the pacing concerns do not block this, but I wanted to at least ask.

@cart I feel that you should review this and for no good reason other than it being fun, you should get to push the bors button.

superdump avatar Jan 18 '23 20:01 superdump

Not critical, no. I find the way BeforeIoAfterRenderStart is added to be a bit confusing. It would make a lot more sense if this was part of the default app event loop. IMO we should remove BeforeIoAfterRenderStart if it doesn't work for non-pipelined apps, and add it in a separate PR instead.

aevyrie avatar Jan 18 '23 21:01 aevyrie

It would make a lot more sense if this was part of the default app event loop. IMO we should remove BeforeIoAfterRenderStart if it doesn't work for non-pipelined apps, and add it in a separate PR instead.

Hmm. Might be better to just let the frame pacing plugin itself add the stages then. It'll need to be different between when pipelined rendering is enabled or not. With pipelined rendering you need to insert after the extract schedule, but with it disabled you need to add it to the end of the rendering schedule.

hymm avatar Jan 18 '23 21:01 hymm

framepacing would have to do its sleep after extract because that's technically the "beginning" of the next frame (before winit loads new input events).

Yup, this is exactly what I'm doing in bevy_framepace. It would be better if we could do it at the start of the event loop before polling, though I don't know by how much. I can look into how we can do this in another PR, though it will depend on if we intend to keep the winit and bevy update loops linked.

aevyrie avatar Jan 19 '23 17:01 aevyrie

bors r+

cart avatar Jan 19 '23 23:01 cart