Ignore `Timeout` errors on Linux AMD

Open nicopap opened this issue 1 year ago • 6 comments

Objective

Fix #3606
Fix #4579
Fix #3380

Solution

When running on a Linux machine with AMD device, when calling surface.get_current_texture(), ignore wgpu::SurfaceError::Timeout errors.

Alternative

An alternative solution found in the wgpu examples is:

let frame = surface
    .get_current_texture()
    .or_else(|_| {
        render_device.configure_surface(surface, &swap_chain_descriptor);
        surface.get_current_texture()
    })
    .expect("Error reconfiguring surface");
window.swap_chain_texture = Some(TextureView::from(frame));

See: https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370

The reason I went with this PR's solution is that configure_surface seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without configure_surface.

I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the Timeout occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug.

Sep 12 '22 12:09 nicopap

@mdickopp @tirithen @alexpyattaev @bobhenkel @chiboreache @paullouisageneau @popojan @etam You seem to have hit this bug, could a benevolent soul try out this patch?

Sep 12 '22 12:09 nicopap

Bevy seems to no longer crash on Wayland for me, finally can disable XWayland in orichalcum.

Sep 12 '22 14:09 heavyrain266

Tried the method from 4579, but didn't get any errors. Is there a reliable minimal way to reproduce? Running X11 on AMD XT6700 with 0.9.0-dev.

Sep 13 '22 08:09 Ixentus

I am sorry, in the meantime I switched to Wayland/sway and I cannot easily reproduce the problem anymore.

When I tried to use nicopap's revision as dependency and recompile I got an error stating cannot provide explicit generic arguments when impl Trait is used in argument position related to ::<SystemStage> usage in mod.rs, but it may well be my fault, I am new to Rust.

Sep 13 '22 13:09 popojan

This fixes both #3380 and #4579 for me. Thanks!

(Since I do not use Wayland, I cannot test #3606).

Sep 13 '22 16:09 mdickopp

Please note that I can reproduce both #3380 and #4579 on an Intel device, so I do not think they are specific to AMD devices.

2022-09-13T17:14:48.220831Z  INFO winit::platform_impl::platform::x11::window: Guessed window scale factor: 1.75    
2022-09-13T17:14:48.260219Z  INFO bevy_render::renderer: AdapterInfo { name: "Intel(R) HD Graphics 5500 (BDW GT2)", vendor: 32902, device: 5654, device_type: IntegratedGpu, backend: Vulkan }

Sep 13 '22 17:09 mdickopp

Draft until workaround expanded to intel devices. I also happen to have a intel GPU handy, so I might be able to test as well.

Sep 24 '22 12:09 nicopap

@mdickopp

Please note that I can reproduce both https://github.com/bevyengine/bevy/issues/3380 and https://github.com/bevyengine/bevy/issues/4579 on an Intel device

Hmm, can't reproduce on my Whiskey Lake intel iGPU. Looks like you are using a Broadwell, which is very common. I've a Broadwell CPU somewhere, but, at the moment I can't test on it, as the motherboard is pretty much in a cardboard box without peripherals.

Sep 26 '22 11:09 nicopap

I wonder if it wouldn't be better to ignore timeouts for all cards and drivers, but only for a specific time. That is, ignore intermittent ones:

That way we can still catch degraded application/driver state because when the driver is returning timeouts for a whole, say, second or two something very much looks amiss, at the same time delaying panic on non-problematic configurations seems benign. We can even make the timeout configurable in case some valiant gamer tries to run things on a potato or something.

Bonus: A message like "Graphics driver returned timeouts for X seconds" points end-users squarely at the issue.

Sep 27 '22 19:09 ksf

I wonder if it wouldn't be better to ignore timeouts for all cards and drivers, but only for a specific time. That is, ignore intermittent ones:

I'm interested in this solution; the less hardcoded special-casing the better.

Sep 27 '22 20:09 alice-i-cecile

@ksf For me personally, the timeout happens every frame, despite the frame clearly drawing in less than the actual timeout, so your proposed solution wouldn't work (frame draws in well below 16ms, timeout is a full second)

https://github.com/gfx-rs/wgpu/issues/1218 Says it happens "every few seconds".
https://github.com/gfx-rs/wgpu/issues/2941 logs seem to imply it happens every frame.

It might be possible to not special-case it, which is, from what I understand, how Veloren does it. I guess I was worried that I would break other assumptions. As far as I know, it shouldn't break anything, but "if it works on Linux it works on Windows" is not a sentence I've heard many times… So I kept conservative to exactly what I changed. I'd be happy to remove the #[cfg(target_os = "linux")] if someone else can confirm it works. But I'm worried another contributor will chime in soon and say "if this bug is limited to Linux, why not add #[cfg(target_os = "linux")]?" And then I'll have to write another lengthy reply justifying my decisions.

Anyway, maybe the fact we log on debug! is misleading and is what led you to believe it was intermittent? Should we entirely skip logging or log on trace! level?

Remember please that this workaround fixes a bug that prevents people from using bevy at all, so getting it in at all should be a priority, getting fancy with it can wait IMO (maybe open an issue once this is merged?)

Sep 28 '22 17:09 nicopap

@nicopap In my case the timeout happens during configuring events, things like resizing, that's why I assumed it was an intermittent issue. But on hindsight, if the user is drag-resizing the window for a second and every request times out and thus the "last successful" time can't get reset my solution would still panic. I'm not sure whether all requests time out in my case, would have to investigate (currently, at the state of development of my code (early) I said "meh" and hard-coded the present mode to AutoNoVSync).

I definitely agree that your solution is better than just crashing, and even when things get more well-behaved upstream we'll have to support older drivers (though at some point I'd say it becomes sensible to tell people to upgrade or disable VSync/eat the performance hit)

Sep 29 '22 05:09 ksf

@nicopap

Hmm, can't reproduce on my Whiskey Lake intel iGPU. Looks like you are using a Broadwell, which is very common. I've a Broadwell CPU somewhere, but, at the moment I can't test on it, as the motherboard is pretty much in a cardboard box without peripherals.

I re-tested your latest commit (a14a344) on my Intel system, and can confirm that it fixes the bugs for me.

Sep 29 '22 16:09 mdickopp

This looks controversial. I hear the reasoning. I guess ideally it would be fixed in mesa but even then it will take time for new drivers to roll out. Still, I’d like to understand it before ignoring the error.

Sep 29 '22 17:09 superdump

100% Agree. I really dislike the fact I basically don't understand what I'm doing here.

Sep 30 '22 11:09 nicopap

After some digging I found out that the Vulkan backend of wgpu calls vkAcquireNextImageKHR with a timeout of 1 second.

A comment in (an older version of) the source code of Chromium hints at a bug in X11 and how they worked around it: https://chromium.googlesource.com/chromium/src/+/8ec9935d64c1fcc72d09c2d44ac1dfc0a29514f3/gpu/vulkan/x/vulkan_surface_x11.cc#62 For one thing, they use a 2 seconds timeout. I'll do some more testing.

Sep 30 '22 13:09 mdickopp

Setting the timeout in wgpu to 2 seconds does not fix ~the issues~ issue #3380 for me. (EDIT: See more detailed explanation below.)

Sep 30 '22 13:09 mdickopp

After some digging I found out that the Vulkan backend of wgpu calls vkAcquireNextImageKHR with a timeout of 1 second.

Quoth the spec:

VK_NOT_READY is returned if timeout is zero and no image was available.

VK_TIMEOUT is returned if timeout is greater than zero and less than UINT64_MAX, and no image became available within the time allowed.

Until now I thought that the driver had some internal timeout, but it's wgpu which sets the timeout, and VK_TIMEOUT is not an error but a successful return. The driver seems to interpret the timeout rather creatively, though, returning before the timeout duration is over (or it couldn't be happening every frame).

We might be getting the behavior of setting a zero timeout and getting NOT_READY even when setting a timeout, the driver opting to return TIMEOUT in that case to be at least half-way spec compliant, not wanting to trip up programs which don't handle NOT_READY in that situation.

If that's the case then retrying for as long as the timeout is supposed to last and panicking after that should™ work out.

Sep 30 '22 14:09 ksf

I did some more testing. Turns out locking the screen (#4579) and using the import utility (#3380) cause different behavior.

While the screen is locked, the program runs with a framerate of one per second. Since the timeout is also one second, it occurs on some, but not all frames. If I change the timeout to two seconds in wgpu, most of the time no timeouts occurs, but occasionally there is a single timeout immediately after the screen is locked.

On the other hand, while import is running, every frame times out.

Oct 01 '22 16:10 mdickopp

Interesting 🤔 Can we bump the timeout to like 2 seconds globally and eliminate the first problem?

Oct 01 '22 18:10 alice-i-cecile

Screen locker panic is surely X11 "bug". Protocol XML spec doesn't represents session locking concept like recent ext-session-lock-v1.xml extension for the Wayland protocol. Basically when screen is locked, display server should lock the session which means, none of the conmected clients should have access to the GPU, simply stop rendering and sending any events to the server until session is unlocked.

I can see a lot of issues caused by X11, thats because of general protocol design which was modern 30 years ago during *BSD era. X11 simply doesn't play well with modern graphics and e.g Vulkan is one big UB with many workarounds.

Oct 01 '22 19:10 heavyrain266

@alice-i-cecile

Interesting :thinking: Can we bump the timeout to like 2 seconds globally and eliminate the first problem?

Occasionally (roughly estimated, one time out of five to ten times I lock the screen) there is a single timeout event even with a 2 seconds timeout. But there is never more than one while the screen is locked. I do not understand why this happens.

So bumping the timeout would make the issue occur occasionally instead of every time, but not eliminate it.

Oct 02 '22 11:10 mdickopp

Trying to test the version, it fails to build with error: failed to select a version for unicode-xid which apparently is somewhere in the depths of the deptree for bevy itself. can someone advise on how to properly test the version a14a34453a81e327a9a31da9180718f226dce714 ? I'd love to test the fix and report. PS: sorry for silly questions, my cargo-fu is fairly weak . I understand what I roughly need to do, but not how to achieve that exactly.

Oct 14 '22 12:10 alexpyattaev

@alexpyattaev Have you tried cargo update then try to compile again? It has been known to help.

Oct 14 '22 12:10 nicopap

Thanks @nicopap ! I was 100% sure that cargo clean actually wipes all build artifacts - I was so wrong! Live and learn, I suppose. Anyhow, testing was done usign breakout example and nightly compiler. This is what it outputs into terminal:

cargo +nightly run --features wayland --example breakout
   Compiling bevy v0.9.0-dev (/home/headhunter/git/nico)
    Finished dev [unoptimized + debuginfo] target(s) in 17.84s
     Running `target/debug/examples/breakout`
2022-10-15T07:07:44.720723Z  INFO winit::platform_impl::platform::x11::window: Guessed window scale factor: 1.6666666666666667    
2022-10-15T07:07:44.796340Z  INFO bevy_render::renderer: AdapterInfo { name: "AMD Radeon Vega 8 Graphics", vendor: 4098, device: 5592, device_type: IntegratedGpu, backend: Vulkan }
2022-10-15T07:07:44.797246Z ERROR wgpu_hal::vulkan::instance: VALIDATION [VUID-VkDeviceCreateInfo-pNext-02830 (0x211e533b)]
        Validation Error: [ VUID-VkDeviceCreateInfo-pNext-02830 ] Object 0: handle = 0x55e1467b1860, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0x211e533b | If the pNext chain includes a VkPhysicalDeviceVulkan12Features structure, then it must not include a VkPhysicalDevice8BitStorageFeatures, VkPhysicalDeviceShaderAtomicInt64Features, VkPhysicalDeviceShaderFloat16Int8Features, VkPhysicalDeviceDescriptorIndexingFeatures, VkPhysicalDeviceScalarBlockLayoutFeatures, VkPhysicalDeviceImagelessFramebufferFeatures, VkPhysicalDeviceUniformBufferStandardLayoutFeatures, VkPhysicalDeviceShaderSubgroupExtendedTypesFeatures, VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures, VkPhysicalDeviceHostQueryResetFeatures, VkPhysicalDeviceTimelineSemaphoreFeatures, VkPhysicalDeviceBufferDeviceAddressFeatures, or VkPhysicalDeviceVulkanMemoryModelFeatures structure The Vulkan spec states: If the pNext chain includes a VkPhysicalDeviceVulkan12Features structure, then it must not include a VkPhysicalDevice8BitStorageFeatures, VkPhysicalDeviceShaderAtomicInt64Features, VkPhysicalDeviceShaderFloat16Int8Features, VkPhysicalDeviceDescriptorIndexingFeatures, VkPhysicalDeviceScalarBlockLayoutFeatures, VkPhysicalDeviceImagelessFramebufferFeatures, VkPhysicalDeviceUniformBufferStandardLayoutFeatures, VkPhysicalDeviceShaderSubgroupExtendedTypesFeatures, VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures, VkPhysicalDeviceHostQueryResetFeatures, VkPhysicalDeviceTimelineSemaphoreFeatures, VkPhysicalDeviceBufferDeviceAddressFeatures, or VkPhysicalDeviceVulkanMemoryModelFeatures structure (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-VkDeviceCreateInfo-pNext-02830)    
2022-10-15T07:07:44.797307Z ERROR wgpu_hal::vulkan::instance:   objects: (type: INSTANCE, hndl: 0x55e1467b1860, name: ?)    

^C
cargo +nightly run --features wayland --example breakout  44.15s user 4.85s system 35% cpu 2:19.06 total

As you can see it is not entirely issue-free (scary errors in log, and the engine seems to be unable to exit cleanly without me doing ctrl-C for it). However, overall the engine works just fine (audio, video, input). The original issue you have set out to fix is gone.

Oct 15 '22 07:10 alexpyattaev

I can reproduce this on an NVIDIA GPU (Linux)

Oct 24 '22 01:10 meisme-dev

@meisme-dev Can you give more precise system specs, notably kernel version, distro, card model etc? The fact that it doesn't work with both Fifo and immediate mode tells me it might be an unrelated issue.

See the issue template for how to get the specs https://github.com/bevyengine/bevy/blob/main/.github/ISSUE_TEMPLATE/bug_report.md

Oct 24 '22 09:10 nicopap

@meisme-dev Can you give more precise system specs, notably kernel version, distro, card model etc? The fact that it doesn't work with both Fifo and immediate mode tells me it might be an unrelated issue.

See the issue template for how to get the specs https://github.com/bevyengine/bevy/blob/main/.github/ISSUE_TEMPLATE/bug_report.md

Kernel: 5.15.74 Distro: NixOS Unstable (Raccoon) Card model: RTX 2070 Super Driver version: 520.56.06 wgpu-info: { name: "NVIDIA GeForce RTX 2070 SUPER", vendor: 4318, device: 7812, device_type: DiscreteGpu, driver: "NVIDIA", driver_info: "520.56.06", backend: Vulkan }

Oct 24 '22 14:10 meisme-dev

On the one hand: I think its worth hacking around this quirk if we can, in the interest of getting Bevy running on more computers. Panicking at startup (or intermittently) is a high priority bug fix. We have multiple people saying that this works, and Veloren successfully using it is a reasonable indicator that it works. On the other hand, it feels important to understand what is happening here. Theres a chance that doing the wrong thing here will introduce ghosts in the system, hard-to-debug issues, unnecessary screen "flashing" as timeouts occur, etc.

Oct 30 '22 19:10 cart

I'm going to re-add this to the 0.9 milestone, just so we can make a final call on this if the conversation progresses.

Oct 30 '22 19:10 cart

bevy bevy copied to clipboard

Ignore `Timeout` errors on Linux AMD

Objective

Solution

Alternative

bevy
bevy copied to clipboard