Objective

It's useful to have access to render pipeline statistics, since they provide more information than FPS alone. For example, the number of drawn triangles can be used to debug culling and LODs. The number of fragment shader invocations can provide a more stable alternative metric than GPU elapsed time.

See also: Render node GPU timing overlay #8067, which doesn't provide pipeline statistics, but adds a nice overlay.

Solution

Add RenderStatistics resource, which provides information about CPU & GPU elapsed time, as well as pipeline statistics (primitives count, shader invocations, etc) per each render pass.

Add MeasuredRenderPass, which derefs to regular RenderPass and records timing and pipeline statistics. Note: only labeled passes are recorded.

TrackedRenderPass now internally uses MeasuredRenderPass.

TODO

Maybe store spans instead of elapsed time.
~~Disable statistics when the required features aren't available.~~

Changelog

Add RenderStatisticsPlugin
Add RenderStatistics resource
Update 3d_scene example to print statistics to console

Jul 12 '23 15:07 LeshaInc

Welcome, new contributor!

Please make sure you've read our contributing guide and we look forward to reviewing your pull request shortly ✨

Jul 12 '23 15:07 github-actions[bot]

Disable statistics when the required features aren't available.

Can you say more about this? It sounds like we should block on fixing that before merging this PR to avoid unexpected problems, but I'm not fully sure I understand.

Jul 12 '23 19:07 alice-i-cecile

Can you say more about this? It sounds like we should block on fixing that before merging this PR to avoid unexpected problems, but I'm not fully sure I understand.

I have addressed that. Now when timestamp queries are unsupported, no GPU timing will be recorded. Same for pipeline queries. CPU timings are always recorded though, since they don't require any features.

Also, I have written the docs.

I have the following question: should I keep the RenderStatistics resource, or perhaps they should be stored as diagnostics instead? I would imagine having diagnostics like render.main_opaque_pass_3d.gpu_time. This would give us smoothing basically for free. But I see a few problems with this approach:

How to generate diagnostic UUIDs, given there can be any number of render passes (at least one for each shadow caster, for instance)
How to assign diagnostic instants? They can either share one instant at which they became available, or I could try to approximate instants given GPU timestamps and the instant when the frame was submitted.
Missing ergonomics for nested diagnostics (a render pass would have, like, 7 separate diagnostics, and there's no way to easily iterate through all of them).

I would say it's simpler to leave render statistics as a resource, and a future PR could address copying those into diagnostics and/or providing a GUI overlay, like in #8067.

Jul 12 '23 21:07 LeshaInc

I've gated all recording functionality behind RenderStatisticsPlugin, which is disabled by default.

Additionally, now I reuse buffers instead of creating them every frame. There are 2 query sets, and 2 buffers (one for resolving the query set, and the other for cpu-readback) per frame-in-flight (of which there are 3, apparently).

This makes performance of statistics recording basically negligible (given there are only a couple render passes). I ran many_sprites benchmark and saw no measurable regressions.

Jul 13 '23 22:07 LeshaInc

Don't think we'll have time to finish the impl and get this reviewed on top of all the other rendering work going on atm, so I'm moving this to the 0.13 milestone. Would still love to get this in eventually though.

Sep 28 '23 20:09 JMS55

@LeshaInc this looks neat but has been dormant for a while. Would you like me to put this work up for adoption?

Jan 24 '24 15:01 alice-i-cecile

@LeshaInc this looks neat but has been dormant for a while. Would you like me to put this work up for adoption?

This was mostly blocked on #9266, now that it's merged I could honestly finish it myself in the nearest future. All that's left is integrating the new nested diagnostics, resolving merge conflicts, and adding statistics for the new passes.

Jan 24 '24 16:01 LeshaInc

Awesome, I'm glad to hear it :) Ping me if you need anything else.

Jan 24 '24 17:01 alice-i-cecile

@LeshaInc let me know if you need any help on this. I'd love to get both pipeline statistics, but also timing info.

For now a raw diagnostics output should be the minimal implementation the PR should handle, but in the future it'd be nice to compute a histogram or some sort of smoothed output over multiple frames, as I did in https://github.com/bevyengine/bevy/pull/8067.

I also think it's worth looking into wgpu-profiler. It gives us tracy integration (iirc, and it would be awesome to see both CPU/GPU timing in tracy at once), and handles a lot of the implementation details already.

Jan 24 '24 18:01 JMS55

I've updated everything to store pipeline statistics & timing info in diagnostics, instead of a separate resource. All you need to enable gathering all of this info is to add RenderDiagnosticsPlugin and LogDiagnosticsPlugin.

Here's how the diagnostics look (3d_scene example)

render/main_opaque_pass_3d/clipper_invocations                         :   74.000000   (avg 74.000000)
render/main_opaque_pass_3d/clipper_primitives_out                      :   74.000000   (avg 74.000000)
render/main_opaque_pass_3d/elapsed_cpu                                 :    0.062421ms (avg 0.065678ms)
render/main_opaque_pass_3d/elapsed_gpu                                 :    0.237975ms (avg 0.233948ms)
render/main_opaque_pass_3d/fragment_shader_invocations                 : 403062.000000   (avg 403062.000000)
render/main_opaque_pass_3d/vertex_shader_invocations                   :   92.000000   (avg 92.000000)
render/shadows/elapsed_cpu                                             :    1.017897ms (avg 1.053214ms)
render/shadows/elapsed_gpu                                             :    0.049878ms (avg 0.049476ms)
render/shadows/shadow pass point light 0 -x/clipper_invocations        :   62.000000   (avg 62.000000)
render/shadows/shadow pass point light 0 -x/clipper_primitives_out     :    2.000000   (avg 2.000000)
render/shadows/shadow pass point light 0 -x/elapsed_cpu                :    0.065137ms (avg 0.067303ms)
render/shadows/shadow pass point light 0 -x/elapsed_gpu                :    0.004264ms (avg 0.004489ms)
render/shadows/shadow pass point light 0 -x/fragment_shader_invocations:    0.000000   (avg 0.000000)
render/shadows/shadow pass point light 0 -x/vertex_shader_invocations  :   68.000000   (avg 68.000000)
render/shadows/shadow pass point light 0 -y/clipper_invocations        :   74.000000   (avg 74.000000)
render/shadows/shadow pass point light 0 -y/clipper_primitives_out     :   74.000000   (avg 74.000000)
render/shadows/shadow pass point light 0 -y/elapsed_cpu                :    0.065375ms (avg 0.066253ms)
render/shadows/shadow pass point light 0 -y/elapsed_gpu                :    0.004936ms (avg 0.004821ms)
render/shadows/shadow pass point light 0 -y/fragment_shader_invocations:    0.000000   (avg 0.000000)
render/shadows/shadow pass point light 0 -y/vertex_shader_invocations  :   92.000000   (avg 92.000000)

(I've stripped irrelevant info and sorted the lines, since LogDiagnosticsPlugin currenty logs them in hashmap order)

As you can see, there's info for all render passes, including elapsed CPU (recording) and GPU time, and pipeline statistics. They can be easily accessed from DiagnosticsStore programmatically, if needed.

To record diagnostics in render nodes, you have to:

Get the diagnostics recorder from RenderContext.
```
let diagnostics = render_context.diagnostic_recorder();
```
It will be an opaque type implementing RenderDiagnostics trait. It's methods will be no-op unless RenderDiagnosticsPlugin is present. This is done to avoid writing #[cfg(feature = "diagnostics")] everywhere.
Begin the span:
```
let time_span = diagnostics.time_span(render_context.command_encoder(), "shadows");
```
Here command encoder can also be replaced with a render pass, or a compute pass.
End the span. You have to provide the same encoder, which was used to begin the span.
```
time_span.end(render_context.command_encoder());
```

There are two kinds of spans:

Time spans can be used in both top-level encoders, as well as render/compute pass encoders. They only collect elapsed time.
Pass spans can only be used inside passes, and they provide more info, such as number of shader invocations. Their support is more limited, so in case they are unsupported, only CPU time will be recorded.

Now I'd like to address a few points.

@JMS55

For now a raw diagnostics output should be the minimal implementation the PR should handle, but in the future it'd be nice to compute a histogram or some sort of smoothed output over multiple frames, as I did in https://github.com/bevyengine/bevy/pull/8067.

I agree, a visual overlay would be nice. I think it should work for all diagnostics, not just render ones. I imagine some kind of tree view, where you can expand individual diagnostics and look at their plots, histograms, etc. However, bevy_ui is still far from being capable of such complex UI. I think an unofficial visualizer built on top of bevy_egui could be made as a temporary solution.

I also think it's worth looking into wgpu-profiler. It gives us tracy integration (iirc, and it would be awesome to see both CPU/GPU timing in tracy at once), and handles a lot of the implementation details already.

I agree it's best for the ecosystem to use an established crate. However, wgpu-profiler doesn't record pipeline statistics, which is what I was interested in to begin with. I see a few ways to move forward

Keep render profiling inside bevy. The internals are less than 1000 lines long, including the integration with bevy_diagnostic. A future PR could add tracy integration, which shouldn't be too hard.
Keep the same abstract API, but replace the implementation with wgpu-profiler (could be done in a future PR). This requires adding support for pipeline statistics to it, which would take some time to get reviewed, merged, and released.

In both cases, I think it's worth for bevy to have it's own API for recording rendering spans, with integration with TrackedRenderPass, bevy_diagnostic, and other shenanigans. The underlying implementation could always be changed without breaking the API.

Feb 02 '24 22:02 LeshaInc

Nice progress: seems very useful. Is this ready to come out of draft now?

Feb 02 '24 22:02 alice-i-cecile

One consideration is we would want to lock the GPU to base memory/clock speed in order to get stable profiling.

Other than that, once this PR is no longer in draft, I'd be happy to review it and hopefully get this merged early into the 0.14 development cycle :)

Feb 03 '24 06:02 JMS55

@LeshaInc 0.13 is done, lets work on this now :). Can you rebase this? Once done I'll review it.

Feb 28 '24 07:02 JMS55

@JMS55 done

Mar 02 '24 15:03 LeshaInc

Should this live in bevy_dev_tools? @alice-i-cecile

(not to derail the review - that can be done before moving files)

Mar 08 '24 00:03 dmyyy

I think so: it's valuable for discovery, and to ensure that other diagnostics can be built based on the public API.

Mar 08 '24 01:03 alice-i-cecile

Btw I'm unfortunately busy with IRL stuff, sorry for the delay on reviewing this. I still want to get this in as soon as I can.

Mar 08 '24 02:03 JMS55

@LeshaInc if you can get CI passing I'll merge this in for you :) I think you may need to adjust some imports.

Mar 17 '24 19:03 alice-i-cecile

Add pipeline statistics

Objective

Solution

TODO

Changelog