mach mach: linux/vulkan needs frame rate limiter to avoid severe issues with some distro+WM+nvidia combinations

We've had two reports of this type of behavior on (Linux, Vulkan, X11, NVIDIA, and less common distros), where we request no vsync (I believe?) and instead of the frame rate being smoothed, instead e.g. 60 frames are rendered in a burst almost instantaneously, then a vulkan swapchain call pauses for an entire second rendering nothing, then another burst, and so on:

https://user-images.githubusercontent.com/3173176/182849928-c562cc4f-206d-4d66-ac39-8bb0ebe2fc00.mp4

All instances appear to be solved by adding a frame rate limiter, e.g. opening any example application and adding std.time.sleep(16 * std.time.ns_per_ms); to the top of pub fn update

The reports:

adayoldbagel on Discord:
- Distro: Arch
- X11/Wayland: X11
- Window manager: bspwm, no desktop environment
- Vulkan driver: proprietary nvidia 515.57
- GPU: discrete RTX 2070 SUPER
- Using GPU_BACKEND=opengl works around the issue? Yes
- Using frame rate limiter solves the issue? Yes
boopinski on Discord:
- Distro: lubuntu
- X11/Wayland: X11
- Window manager: LXDE
- Vulkan driver: proprietary nvidia 510.73.05
- GPU: GTX 750 Ti
- Using GPU_BACKEND=opengl works around the issue? Yes
- Using frame rate limiter solves the issue? Yes
bitflip on Zig Discord:
- xwayland + sway
- Disabling vsync fixed the issue
- Facing it in their own GLX usage (not webgpu/Mach)

I've found that all example programs using vulkan tie up a CPU core to 100% and lock me out of all input for 10 or more seconds per frame. I've tracked the behaviour down to the Swapchain.getCurrentTextureView() call. It seems to just hang for ages

Aug 04 '22 12:08 emidoots

You can test this by opening any one of the examples and dropping at the top of the pub fn update function a call to std.time.sleep(16 * std.time.ns_per_ms);

Jan 02 '23 20:01 emidoots

Implications for Mach core

Because this issue affects how Linux applications function properly, this fits into the scope of Core not just Engine. If someone fails to use a frame rate scheduler on Linux, their application may not work on very many Linux systems - and part of our promise with Mach core is that it should just work well on most Linux systems.

In practice this means that Mach core must have a Timer which measures frame rate timing. I think the only corrective action we should take from the perspective of Core is: measure frame rate timing, identify if the platform is Linux and this bug is occurring in their driver/etc. If it is, then warn the user about this (log it / set a flag somewhere) and then employ frame rate scheduling via a frame rate limiter.

Frame timing information that Engine (and possibly Core?) should expose to users

delta_time (float and ns integral) - the number you should e.g. multiply movement by in order to get frame-rate-independent animation.
frame_time (ns integral) - the amount of time the last frame took, end-to-end. If you were to print something each frame in your application, this would be the time between those print statements.
frame_sleep_time (ns integral) - what portion of frame_time was spent sleeping due to frame rate limiting, if enabled.
frame_gpu_time (ns integral) - what portion of frame_time was spent by the GPU doing work
- TODO: this is tricky information to acquire as it must exclude vsync, need to think through how this will be measured. Probably a Dawn timestamp query is the best way.
app_frame_Rate (integral) - updated every 250ms, if you were to print something each frame in your application the amount of times it was printed since the last measurement * 4.
- This will have a warning that it may be limited by the frame rate limiter, and vsync (on macOS, even if vsync is disabled as there are OS-specific restrictions to how quickly an application may render frames into a Metal queue.)
- Notably this is named app_fps not just fps because: is application frame rate really what you are trying to measure? Even with the fact that it might include vsync time on macOS if vsync is off?
gpu_frame_Rate (integral) - updated every 250ms, this is 250ms divided by frame_gpu_time, i.e. if there was no sleep or vsync at play, then how fast would my application be going?
- I think this makes sense because I believe many people just want a "how fast is my app" number, without thinking about the details of frame rate limiting, vsync, OS restrictions, etc. GPU frame rate is the best indicator of this, communicated in a "FPS" number.

Functionally if we have all this information we can also query the amount of time that vsync likely took via a frameVSyncTime() function which returns end_to_end_time - frame_sleep_time - frame_gpu_time, as that is all that remains. This should include a note that vsync time may not be 0 on macOS even if it is disabled, as there are OS-specific restrictions to how quickly an application may render frames into a Metal queue.

Detecting the Linux bug

The Linux bug described in this issue occurs when:

vsync is enabled
frameVSyncTime() reports 0 (or close to zero) for most frames in a bundle (e.g. 60, 120, 244, etc.) except for one frame (the last one in that bundle.)

If we sample over say ~3s and find 3 frames with very large vsync times, and the remainder having little to no vsync times, then we have detected the bug with a great deal of confidence statistically.

Correcting the Linux bug

Once detected, we need to determine the effective refresh rate of the application without any frame rate limiter. This is 1 second divided by frame_time - frame_sleep_time.

Once we know that information, we set the frame rate limit to be that effective refresh rate - so that our frames are scheduled to be submitted periodically throughout a full second.

The user should be able to override this detection if they desire, but how exactly this is done requires some thought. Maybe an explicit option to disable it.

Frame rate limiting

When a frame rate limiter is set (either by detection of the Linux bug, or by the user calling the API to set the frame rate limit to some constant number) we need to determine the right amount of time to sleep and/or spinlock to get to our desired frame rate.

High-precision sleep is often unavailable, and spinlock is expensive, so what is needed is a dynamic approach which updates every 1/4th of a second:

If the current app_frame_rate is greater than target_frame_rate + sleep_precision_margin, then increment sleeps or spinlocks towards difference (running too fast is better than too slow.)
If the current app_frame_rate is lesser than target_frame_rate, decrease sleeps or spinlocks to "guarantee" the next frame should not be lesser.
Otherwise, do nothing.

When we sleep or spinlock (measure time using high-precision timer) we must over time build up our understanding of their behavior, which may change throughout the application's execution, over a few seconds. Sleeping a short time may not actually be so short, so a combination of sleeps + spinlocks are necessary. The exact amount should be incrementally worked towards until we reach the target frame rate.

Jan 15 '23 18:01 emidoots

OK but actually this discussion/plan is not complete without talking about macOS behavior, too.

on macOS, there is a Metal command queue in Dawn that can fill up. It also appears that Metal does some sort of runtime frame rate detection at application startup.

Here is some concrete observations.

Frame rate is not limited if window not on screen

If you switch to a different virtual screen, the application runs at an unlimited frame rate:

FPS: 60
FPS: 61
FPS: 60
FPS: 61
FPS: 61
FPS: 60
FPS: 153
FPS: 14618
FPS: 14805
FPS: 14787
FPS: 14906
FPS: 14786
FPS: 14820
FPS: 14805
FPS: 14806
FPS: 14811
FPS: 14762
FPS: 14832
FPS: 14833
FPS: 3365
FPS: 61
FPS: 60
FPS: 60
FPS: 61

This indicates a sane frame rate limit is a must for macOS, perhaps making it default to the detected frame rate under normal circumstances (but would need to avoid it detecting "offscreen" as the normal circumstances!)

Startup delay

When starting on macOS with vsync enabled, the frame rate appears unlimited at the start of the application and then "stabilizes" to the actual refresh rate. This can happen very quickly (here each line represents 1 second of time passing):

FPS: 9415
FPS: 61
FPS: 61
FPS: 60
FPS: 61
FPS: 61
FPS: 61

As you can see ~9k frames were rendered in 1 second, then frame submission was limited to the 60hz refresh rate after that.

Importantly, this can happen for a much longer period of time initially if your frames take longer to render / do more work. For example when the deferred-rendering example starts up, we can see that 2560 frames are rendered at 264-316 FPS over 9s at startup before a stable 60hz is reached:

FPS: 316
FPS: 282
FPS: 275
FPS: 281
FPS: 280
FPS: 279
FPS: 249
FPS: 334
FPS: 264
FPS: 61
FPS: 61
FPS: 60
FPS: 61
FPS: 61
FPS: 61

This is also why deferred-rendering looks so janky on startup on macOS, as it doesn't take into account framerate-independent animation when moving the lights. This can go on for upwards of 20s (until that Metal command queue buffer fills up, I presume?):

https://user-images.githubusercontent.com/3173176/212568020-1163db50-dbb6-43b1-abc7-84e82eccdb17.mov

(ignore any FPS indicated in the video above)

How I measured

These measurements were all taken by recording the number of frames that pass in 1s, printing the number and resetting it to zero once 1s of time has elapsed:

var frame_count: usize = 0;

pub fn update(app: *App) !bool {
    app.delta_time = app.timer.lap();
    frame_count += 1;
    if (app.window_title_timer.read() >= 1.0) {
        std.debug.print("FPS: {d}\n", .{frame_count});
        frame_count = 0;
        app.window_title_timer.reset();
    }

Jan 15 '23 21:01 emidoots

Note to self: may also be impacted by GLFW bugs like https://github.com/glfw/glfw/issues/1267 on Windows with borderless fullscreen windows

Jan 28 '23 06:01 emidoots

macOS behavior was fixed by https://github.com/hexops/dawn/commit/6cf893f90f4698cf45cf6601bcb034939f86e51f#diff-b6357fb71c710f4219ce7419aee5c4d6af8afd22254c810a1ab5fe01a10298cc it seems. With vsync .none option I get consistent >1k fps with textured-cube example.

EDIT: actually, it may have been fixed by the multi-threaded rendering change.

Jul 22 '23 05:07 emidoots

decision: before imposing a frame rate limiter on all our Linux users by default, we're going to wait and see if anyone can reproduce this behavior again. If you can, please do comment.

Our hope/belief is that (a) some Dawn changes or (b) the multi-threaded rendering support we added (always on) may have resolved the poor vsync behavior.

Jul 28 '23 03:07 emidoots

I've been able to re-produce this behavior (on 9250310) -- using the textured cube example and "Getting started", though I assume every mach causes this. Speed up and slow-down of the cube spinning, framerate goes between 22, 61, 181, etc. My window manager and entire system become quite unresponsive, and switching window focus is very slow and makes it worse, causing all windows to become unresponsive.

I attempted to record this, but quickly realized it's actually pretty difficult because my encoder quickly fails when trying to do so, and OBS will stop recording if the encoder gets too far behind. A CPU encoder might work if I try it later.

I tried setting MACH_GPU_BACKEND=opengl to see if that fixes it, but that causes a panic:

nfo(mach): found Vulkan backend on Discrete GPU adapter: NVIDIA GeForce RTX 2070 SUPER, NVIDIA: 525.116.04 525.116.4.0

info(mach): gamemode: activated
error(mach): mach: device lost: CreateSwapChain failed with VK_ERROR_NATIVE_WINDOW_IN_USE_KHR
 - While handling unexpected error type Internal when allowed errors are (Validation|DeviceLost).
    at CheckVkSuccessImpl (/home/runner/work/mach-gpu-dawn/mach-gpu-dawn/libs/dawn/src/dawn/native/vulkan/VulkanError.cpp:88)
    at Initialize (/home/runner/work/mach-gpu-dawn/mach-gpu-dawn/libs/dawn/src/dawn/native/vulkan/SwapChainVk.cpp:344)
    at Create (/home/runner/work/mach-gpu-dawn/mach-gpu-dawn/libs/dawn/src/dawn/native/vulkan/SwapChainVk.cpp:250)
    at CreateSwapChain (/home/runner/work/mach-gpu-dawn/mach-gpu-dawn/libs/dawn/src/dawn/native/Device.cpp:1779)

thread 584035 panic: mach: device lost
/home/james/mach-core/src/platform/native/Core.zig:128:5: 0xc49d9b in deviceLostCallback (textured-cube)
    @panic("mach: device lost");
    ^
???:?:?: 0x60719e in ??? (???)
???:?:?: 0x61c419 in ??? (???)
???:?:?: 0x623c1a in ??? (???)
Unwind error at address `:0x623c1a` (error.UnimplementedUserOpcode), trace may be incomplete

/home/james/zig-linux-x86_64-0.12.0-dev.1092+68ed78775/lib/std/Thread.zig:412:13: 0xccb721 in callFn__anon_43598 (textured-cube)
            @call(.auto, f, args);
            ^
/home/james/zig-linux-x86_64-0.12.0-dev.1092+68ed78775/lib/std/Thread.zig:685:30: 0xc681c2 in entryFn (textured-cube)
                return callFn(f, args_ptr.*);
                             ^
???:?:?: 0x7efcb63ac9ea in ??? (libc.so.6)

I can supply much more information here, or create a new issue, just let me know what's most important and where I should put it.

Nov 26 '23 00:11 drunderscore