bevy
bevy copied to clipboard
Ignore `Timeout` errors on Linux AMD
Objective
- Fix #3606
- Fix #4579
- Fix #3380
Solution
When running on a Linux machine with AMD device, when calling
surface.get_current_texture()
, ignore wgpu::SurfaceError::Timeout
errors.
Alternative
An alternative solution found in the wgpu
examples is:
let frame = surface
.get_current_texture()
.or_else(|_| {
render_device.configure_surface(surface, &swap_chain_descriptor);
surface.get_current_texture()
})
.expect("Error reconfiguring surface");
window.swap_chain_texture = Some(TextureView::from(frame));
The reason I went with this PR's solution is that configure_surface
seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without configure_surface
.
I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the Timeout
occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug.
@mdickopp @tirithen @alexpyattaev @bobhenkel @chiboreache @paullouisageneau @popojan @etam You seem to have hit this bug, could a benevolent soul try out this patch?
Bevy seems to no longer crash on Wayland for me, finally can disable XWayland in orichalcum.
Tried the method from 4579, but didn't get any errors. Is there a reliable minimal way to reproduce? Running X11 on AMD XT6700 with 0.9.0-dev.
I am sorry, in the meantime I switched to Wayland/sway and I cannot easily reproduce the problem anymore.
When I tried to use nicopap's revision as dependency and recompile I got an error stating cannot provide explicit generic arguments when impl Trait is used in argument position
related to ::<SystemStage>
usage in mod.rs
, but it may well be my fault, I am new to Rust.
This fixes both #3380 and #4579 for me. Thanks!
(Since I do not use Wayland, I cannot test #3606).
Please note that I can reproduce both #3380 and #4579 on an Intel device, so I do not think they are specific to AMD devices.
2022-09-13T17:14:48.220831Z INFO winit::platform_impl::platform::x11::window: Guessed window scale factor: 1.75
2022-09-13T17:14:48.260219Z INFO bevy_render::renderer: AdapterInfo { name: "Intel(R) HD Graphics 5500 (BDW GT2)", vendor: 32902, device: 5654, device_type: IntegratedGpu, backend: Vulkan }
Draft until workaround expanded to intel devices. I also happen to have a intel GPU handy, so I might be able to test as well.
@mdickopp
Please note that I can reproduce both https://github.com/bevyengine/bevy/issues/3380 and https://github.com/bevyengine/bevy/issues/4579 on an Intel device
Hmm, can't reproduce on my Whiskey Lake intel iGPU. Looks like you are using a Broadwell, which is very common. I've a Broadwell CPU somewhere, but, at the moment I can't test on it, as the motherboard is pretty much in a cardboard box without peripherals.
I wonder if it wouldn't be better to ignore timeouts for all cards and drivers, but only for a specific time. That is, ignore intermittent ones:
That way we can still catch degraded application/driver state because when the driver is returning timeouts for a whole, say, second or two something very much looks amiss, at the same time delaying panic on non-problematic configurations seems benign. We can even make the timeout configurable in case some valiant gamer tries to run things on a potato or something.
Bonus: A message like "Graphics driver returned timeouts for X seconds" points end-users squarely at the issue.
I wonder if it wouldn't be better to ignore timeouts for all cards and drivers, but only for a specific time. That is, ignore intermittent ones:
I'm interested in this solution; the less hardcoded special-casing the better.
@ksf For me personally, the timeout happens every frame, despite the frame clearly drawing in less than the actual timeout, so your proposed solution wouldn't work (frame draws in well below 16ms, timeout is a full second)
- https://github.com/gfx-rs/wgpu/issues/1218 Says it happens "every few seconds".
- https://github.com/gfx-rs/wgpu/issues/2941 logs seem to imply it happens every frame.
It might be possible to not special-case it, which is, from what I understand, how Veloren does it. I guess I was worried that I would break other assumptions. As far as I know, it shouldn't break anything, but "if it works on Linux it works on Windows" is not a sentence I've heard many times… So I kept conservative to exactly what I changed. I'd be happy to remove the #[cfg(target_os = "linux")]
if someone else can confirm it works. But I'm worried another contributor will chime in soon and say "if this bug is limited to Linux, why not add #[cfg(target_os = "linux")]
?" And then I'll have to write another lengthy reply justifying my decisions.
Anyway, maybe the fact we log on debug!
is misleading and is what led you to believe it was intermittent? Should we entirely skip logging or log on trace!
level?
Remember please that this workaround fixes a bug that prevents people from using bevy at all, so getting it in at all should be a priority, getting fancy with it can wait IMO (maybe open an issue once this is merged?)
@nicopap In my case the timeout happens during configuring events, things like resizing, that's why I assumed it was an intermittent issue. But on hindsight, if the user is drag-resizing the window for a second and every request times out and thus the "last successful" time can't get reset my solution would still panic. I'm not sure whether all requests time out in my case, would have to investigate (currently, at the state of development of my code (early) I said "meh" and hard-coded the present mode to AutoNoVSync
).
I definitely agree that your solution is better than just crashing, and even when things get more well-behaved upstream we'll have to support older drivers (though at some point I'd say it becomes sensible to tell people to upgrade or disable VSync/eat the performance hit)
@nicopap
Hmm, can't reproduce on my Whiskey Lake intel iGPU. Looks like you are using a Broadwell, which is very common. I've a Broadwell CPU somewhere, but, at the moment I can't test on it, as the motherboard is pretty much in a cardboard box without peripherals.
I re-tested your latest commit (a14a344) on my Intel system, and can confirm that it fixes the bugs for me.
This looks controversial. I hear the reasoning. I guess ideally it would be fixed in mesa but even then it will take time for new drivers to roll out. Still, I’d like to understand it before ignoring the error.
100% Agree. I really dislike the fact I basically don't understand what I'm doing here.
After some digging I found out that the Vulkan backend of wgpu calls vkAcquireNextImageKHR with a timeout of 1 second.
A comment in (an older version of) the source code of Chromium hints at a bug in X11 and how they worked around it: https://chromium.googlesource.com/chromium/src/+/8ec9935d64c1fcc72d09c2d44ac1dfc0a29514f3/gpu/vulkan/x/vulkan_surface_x11.cc#62 For one thing, they use a 2 seconds timeout. I'll do some more testing.
Setting the timeout in wgpu to 2 seconds does not fix ~the issues~ issue #3380 for me. (EDIT: See more detailed explanation below.)
After some digging I found out that the Vulkan backend of wgpu calls vkAcquireNextImageKHR with a timeout of 1 second.
Quoth the spec:
VK_NOT_READY is returned if timeout is zero and no image was available.
VK_TIMEOUT is returned if timeout is greater than zero and less than UINT64_MAX, and no image became available within the time allowed.
Until now I thought that the driver had some internal timeout, but it's wgpu which sets the timeout, and VK_TIMEOUT
is not an error but a successful return. The driver seems to interpret the timeout rather creatively, though, returning before the timeout duration is over (or it couldn't be happening every frame).
We might be getting the behavior of setting a zero timeout and getting NOT_READY
even when setting a timeout, the driver opting to return TIMEOUT
in that case to be at least half-way spec compliant, not wanting to trip up programs which don't handle NOT_READY
in that situation.
If that's the case then retrying for as long as the timeout is supposed to last and panicking after that should™ work out.
I did some more testing. Turns out locking the screen (#4579) and using the import
utility (#3380) cause different behavior.
While the screen is locked, the program runs with a framerate of one per second. Since the timeout is also one second, it occurs on some, but not all frames. If I change the timeout to two seconds in wgpu, most of the time no timeouts occurs, but occasionally there is a single timeout immediately after the screen is locked.
On the other hand, while import
is running, every frame times out.
Interesting 🤔 Can we bump the timeout to like 2 seconds globally and eliminate the first problem?
Screen locker panic is surely X11 "bug". Protocol XML spec doesn't represents session locking concept like recent ext-session-lock-v1.xml
extension for the Wayland protocol. Basically when screen is locked, display server should lock the session which means, none of the conmected clients should have access to the GPU, simply stop rendering and sending any events to the server until session is unlocked.
I can see a lot of issues caused by X11, thats because of general protocol design which was modern 30 years ago during *BSD era. X11 simply doesn't play well with modern graphics and e.g Vulkan is one big UB with many workarounds.
@alice-i-cecile
Interesting :thinking: Can we bump the timeout to like 2 seconds globally and eliminate the first problem?
Occasionally (roughly estimated, one time out of five to ten times I lock the screen) there is a single timeout event even with a 2 seconds timeout. But there is never more than one while the screen is locked. I do not understand why this happens.
So bumping the timeout would make the issue occur occasionally instead of every time, but not eliminate it.
Trying to test the version, it fails to build with
error: failed to select a version for unicode-xid
which apparently is somewhere in the depths of the deptree for bevy itself.
can someone advise on how to properly test the version a14a34453a81e327a9a31da9180718f226dce714 ? I'd love to test the fix and report.
PS: sorry for silly questions, my cargo-fu is fairly weak . I understand what I roughly need to do, but not how to achieve that exactly.
@alexpyattaev Have you tried cargo update
then try to compile again? It has been known to help.
Thanks @nicopap ! I was 100% sure that cargo clean actually wipes all build artifacts - I was so wrong! Live and learn, I suppose. Anyhow, testing was done usign breakout example and nightly compiler. This is what it outputs into terminal:
cargo +nightly run --features wayland --example breakout
Compiling bevy v0.9.0-dev (/home/headhunter/git/nico)
Finished dev [unoptimized + debuginfo] target(s) in 17.84s
Running `target/debug/examples/breakout`
2022-10-15T07:07:44.720723Z INFO winit::platform_impl::platform::x11::window: Guessed window scale factor: 1.6666666666666667
2022-10-15T07:07:44.796340Z INFO bevy_render::renderer: AdapterInfo { name: "AMD Radeon Vega 8 Graphics", vendor: 4098, device: 5592, device_type: IntegratedGpu, backend: Vulkan }
2022-10-15T07:07:44.797246Z ERROR wgpu_hal::vulkan::instance: VALIDATION [VUID-VkDeviceCreateInfo-pNext-02830 (0x211e533b)]
Validation Error: [ VUID-VkDeviceCreateInfo-pNext-02830 ] Object 0: handle = 0x55e1467b1860, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0x211e533b | If the pNext chain includes a VkPhysicalDeviceVulkan12Features structure, then it must not include a VkPhysicalDevice8BitStorageFeatures, VkPhysicalDeviceShaderAtomicInt64Features, VkPhysicalDeviceShaderFloat16Int8Features, VkPhysicalDeviceDescriptorIndexingFeatures, VkPhysicalDeviceScalarBlockLayoutFeatures, VkPhysicalDeviceImagelessFramebufferFeatures, VkPhysicalDeviceUniformBufferStandardLayoutFeatures, VkPhysicalDeviceShaderSubgroupExtendedTypesFeatures, VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures, VkPhysicalDeviceHostQueryResetFeatures, VkPhysicalDeviceTimelineSemaphoreFeatures, VkPhysicalDeviceBufferDeviceAddressFeatures, or VkPhysicalDeviceVulkanMemoryModelFeatures structure The Vulkan spec states: If the pNext chain includes a VkPhysicalDeviceVulkan12Features structure, then it must not include a VkPhysicalDevice8BitStorageFeatures, VkPhysicalDeviceShaderAtomicInt64Features, VkPhysicalDeviceShaderFloat16Int8Features, VkPhysicalDeviceDescriptorIndexingFeatures, VkPhysicalDeviceScalarBlockLayoutFeatures, VkPhysicalDeviceImagelessFramebufferFeatures, VkPhysicalDeviceUniformBufferStandardLayoutFeatures, VkPhysicalDeviceShaderSubgroupExtendedTypesFeatures, VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures, VkPhysicalDeviceHostQueryResetFeatures, VkPhysicalDeviceTimelineSemaphoreFeatures, VkPhysicalDeviceBufferDeviceAddressFeatures, or VkPhysicalDeviceVulkanMemoryModelFeatures structure (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-VkDeviceCreateInfo-pNext-02830)
2022-10-15T07:07:44.797307Z ERROR wgpu_hal::vulkan::instance: objects: (type: INSTANCE, hndl: 0x55e1467b1860, name: ?)
^C
cargo +nightly run --features wayland --example breakout 44.15s user 4.85s system 35% cpu 2:19.06 total
As you can see it is not entirely issue-free (scary errors in log, and the engine seems to be unable to exit cleanly without me doing ctrl-C for it). However, overall the engine works just fine (audio, video, input). The original issue you have set out to fix is gone.
I can reproduce this on an NVIDIA GPU (Linux)
@meisme-dev Can you give more precise system specs, notably kernel version, distro, card model etc? The fact that it doesn't work with both Fifo and immediate mode tells me it might be an unrelated issue.
See the issue template for how to get the specs https://github.com/bevyengine/bevy/blob/main/.github/ISSUE_TEMPLATE/bug_report.md
@meisme-dev Can you give more precise system specs, notably kernel version, distro, card model etc? The fact that it doesn't work with both Fifo and immediate mode tells me it might be an unrelated issue.
See the issue template for how to get the specs https://github.com/bevyengine/bevy/blob/main/.github/ISSUE_TEMPLATE/bug_report.md
Kernel: 5.15.74 Distro: NixOS Unstable (Raccoon) Card model: RTX 2070 Super Driver version: 520.56.06 wgpu-info: { name: "NVIDIA GeForce RTX 2070 SUPER", vendor: 4318, device: 7812, device_type: DiscreteGpu, driver: "NVIDIA", driver_info: "520.56.06", backend: Vulkan }
On the one hand: I think its worth hacking around this quirk if we can, in the interest of getting Bevy running on more computers. Panicking at startup (or intermittently) is a high priority bug fix. We have multiple people saying that this works, and Veloren successfully using it is a reasonable indicator that it works. On the other hand, it feels important to understand what is happening here. Theres a chance that doing the wrong thing here will introduce ghosts in the system, hard-to-debug issues, unnecessary screen "flashing" as timeouts occur, etc.
I'm going to re-add this to the 0.9 milestone, just so we can make a final call on this if the conversation progresses.