wgpu icon indicating copy to clipboard operation
wgpu copied to clipboard

Vulkan hangs after a certain render sequence, and either panics, hangs or loses device

Open Dinnerbone opened this issue 5 years ago • 12 comments

Description It's a little hard to turn this into a small repro case so please forgive the vagueness.

After submitting a frame to wgpu whilst using Vulkan backend, Vulkan seems to become unstable and this manifests itself in a few ways:

  • A hang of the application
  • Graphics device crashes and PC dies (this happened to me a few times)
  • The submit seemingly returns okay but nothing actually happened, and the device will become lost the next time we try to draw a frame

Our application has two ways to reproduce the bug:

  • When rendering to a window, we repeatedly submit frames as a typical game would. This often just locks up but sometimes will gracefully give you an error about the device being lost.
  • Rendering one single frame to a texture, saved to disk.

In this second case, we perform the following sequence of events:

  • Create a texture
  • Draw a frame to a command encoder
  • Submit the command encoder to the queue
  • Copy the texture buffer to disk, much like the capture example in wgpu-rs

Seemingly the submit returns okay but the texture is completely empty, when we'd expect to see some graphics in it. The application then freezes (at least, for me on windows - this seems to vary) when dropping wgpu::Instance. For reference, the image it spits out should be identical to this one.

I've taken a trace of this single-frame capture and had to manually close the toml as the recording can't finish. This seems to freeze when played back, but I'm unable to get renderdoc to play nice and see anything from it.

This worked for us in the past, I think as soon as 24 days ago I was running this without issues. The same code, unchanged, no longer works today.

Repro steps I haven't been able to create a minimal reproducible example, but you can see it in our project with the following steps:

  • Grab this swf
  • Clone Ruffle
  • cargo run --package=ruffle_desktop -- test.swf if you want to see it visually, with multiple frames
  • cargo run --package=exporter -- test.swf if you want to see the single frame saved to a texture on disk

You can apply this commit to reduce the amount of rendering done to the bare minimum that still crashes, with that particular swf: https://github.com/Dinnerbone/ruffle/commit/b4f173dbbc9db0128cd2c591d71efbd49e024852

Expected vs observed behavior I expect to either get an error describing how we're using wgpu wrong, or for it to work :D

Extra materials

Platform Reproduced on Windows. Only affects Vulkan backend. We're seeing some instability with DX12 but not certain it's related yet. Reproduced on wgpu 0.6 and https://github.com/gfx-rs/wgpu-rs/commit/e3eadca8c626beb9a1c25c359b0e20f6fdef00c4

Dinnerbone avatar Oct 13 '20 22:10 Dinnerbone

I took some time to bisect the driver version where the hang occurs on my machine: Geforce RTX 2080 Ti Win 10 64-bit GeForce Game Ready Driver 452.06 (Aug 17) works GeForce Game Ready Driver 456.38 (Sep 17) and later hangs

Herschel avatar Oct 14 '20 01:10 Herschel

It looks like there's validation errors when it panics running desktop (swap chain + multiple frames). The panic happens after 3 minutes of the first frame submission.

[2020-10-14T17:09:27Z INFO  ruffle_core::player] Loaded SWF version 15, with a resolution of 550x400
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroyImageView-imageView-01026 (1672225264)] : Validation Error: [ VUID-vkDestroyImageView-imageView-01026 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x63ac21f0 | Cannot call vkDestroyImageView on VkImageView 0x731f0f000000000a[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to imageView must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroyImageView-imageView-01026)
    object info: (type: DEVICE, hndl: 2300667417784)
    
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroyFramebuffer-framebuffer-00892 (-617577710)] : Validation Error: [ VUID-vkDestroyFramebuffer-framebuffer-00892 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xdb308312 | Cannot call vkDestroyFramebuffer on VkFramebuffer 0xfba8190000000804[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)
    object info: (type: DEVICE, hndl: 2300667417784)
    
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroyFramebuffer-framebuffer-00892 (-617577710)] : Validation Error: [ VUID-vkDestroyFramebuffer-framebuffer-00892 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xdb308312 | Cannot call vkDestroyFramebuffer on VkFramebuffer 0xeaf23a0000000807[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)
    object info: (type: DEVICE, hndl: 2300667417784)
    
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroySemaphore-semaphore-01137 (-1588160456)] : Validation Error: [ VUID-vkDestroySemaphore-semaphore-01137 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xa1569838 | Cannot call vkDestroySemaphore on VkSemaphore 0xe81828000000000d[] that is currently in use by a command buffer. The Vulkan spec states: All submitted batches that refer to semaphore must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroySemaphore-semaphore-01137)
    object info: (type: DEVICE, hndl: 2300667417784)
    
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Ok(())`,
 right: `Err(ERROR_DEVICE_LOST)`', C:\Users\dinne\.cargo\registry\src\github.com-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516:9
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:475
   1: std::panicking::begin_panic_fmt
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:429
   2: gfx_backend_vulkan::{{impl}}::submit<gfx_backend_vulkan::command::CommandBuffer,core::iter::adapters::chain::Chain<core::option::IntoIter<gfx_backend_vulkan::command::CommandBuffer*>, core::iter::adapters::flatten::FlatMap<core::slice::Iter<wgpu_core::id:
             at C:\Users\dinne\.cargo\registry\src\github.com-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516
   3: wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>::queue_submit<wgpu_core::hub::IdentityManagerFactory,gfx_backend_vulkan::Backend>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-3d51dad24d4bec0d\3b76651\wgpu-core\src\device\queue.rs:629
   4: wgpu::backend::direct::{{impl}}::queue_submit<core::iter::adapters::Map<alloc::vec::IntoIter<wgpu::CommandBuffer>, closure-0>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\backend\direct.rs:1507
   5: wgpu::Queue::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\lib.rs:2601
   6: ruffle_render_wgpu::target::{{impl}}::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at .\render\wgpu\src\target.rs:99
   7: ruffle_render_wgpu::{{impl}}::end_frame<ruffle_render_wgpu::target::SwapChainTarget>
             at .\render\wgpu\src\lib.rs:1304
   8: ruffle_core::player::Player::render
             at .\core\src\player.rs:832
   9: ruffle_desktop::run_player::{{closure}}
             at .\desktop\src\main.rs:180

Edit: Actually, that was when testing with #987, with commit https://github.com/kvark/wgpu/commit/3b7665140f841b687ad7969387272d3e99f08c48. One commit before, https://github.com/kvark/wgpu/commit/3be2c452c4fda69cab13f0e2c79d2338f640a3f0, there's no validation errors:

[2020-10-14T17:41:58Z INFO  ruffle_core::player] Loaded SWF version 15, with a resolution of 550x400
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Ok(())`,
 right: `Err(ERROR_DEVICE_LOST)`', C:\Users\dinne\.cargo\registry\src\github.com-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516:9
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:475
   1: std::panicking::begin_panic_fmt
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:429
   2: gfx_backend_vulkan::{{impl}}::submit<gfx_backend_vulkan::command::CommandBuffer,core::iter::adapters::chain::Chain<core::option::IntoIter<gfx_backend_vulkan::command::CommandBuffer*>, core::iter::adapters::flatten::FlatMap<core::slice::Iter<wgpu_core::id:
             at C:\Users\dinne\.cargo\registry\src\github.com-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516
   3: wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>::queue_submit<wgpu_core::hub::IdentityManagerFactory,gfx_backend_vulkan::Backend>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-3d51dad24d4bec0d\3be2c45\wgpu-core\src\device\queue.rs:629
   4: wgpu::backend::direct::{{impl}}::queue_submit<core::iter::adapters::Map<alloc::vec::IntoIter<wgpu::CommandBuffer>, closure-0>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\backend\direct.rs:1507
   5: wgpu::Queue::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\lib.rs:2601
   6: ruffle_render_wgpu::target::{{impl}}::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at .\render\wgpu\src\target.rs:99
   7: ruffle_render_wgpu::{{impl}}::end_frame<ruffle_render_wgpu::target::SwapChainTarget>
             at .\render\wgpu\src\lib.rs:1304
   8: ruffle_core::player::Player::render
             at .\core\src\player.rs:832
   9: ruffle_desktop::run_player::{{closure}}
             at .\desktop\src\main.rs:180

Dinnerbone avatar Oct 14 '20 17:10 Dinnerbone

Oh interesting, thank you! I'll try leaving it running for 5 minutes, I guess :)

kvark avatar Oct 14 '20 17:10 kvark

Is this still an issue, now that we landed the gfx-memory fix?

kvark avatar Oct 15 '20 15:10 kvark

We didn't use 0.2.1 for this, we were locked on 0.2.0.

I just upgraded to 0.2.2 just in case but the issue still persists.

I think that was a separate issue that I initially confused with this because cargo install doesn't respect lockfiles by default 😅

Dinnerbone avatar Oct 15 '20 15:10 Dinnerbone

What exactly are the repro steps now? Run cargo run --package=ruffle_desktop -- test.swf and wait N minutes?

kvark avatar Oct 15 '20 15:10 kvark

Two repro steps but it looks like you need to be on windows with a geforce driver >= 456.38

cargo run --package=ruffle_desktop -- test.swf this will make it freeze immediately, and panic after 3 minutes. cargo run --package=exporter -- test.swf this will spit out an image test.png which is incorrectly empty, and then hang the program as it tries to drop wgpu::Instance

Dinnerbone avatar Oct 15 '20 15:10 Dinnerbone

@Dinnerbone finally got to test this on Windows/NV GTX 1050 Ti/Vulkan. It runs fine... Although my driver version is 443, and it's the latest Lenovo considers valid for this Thinkpad X1 Extreme. Force-installing anything fresher may invite for more trouble than it's worth. Looks like you found a genuine NVidia Vulkan bug. Looking forward to see if they respond!

kvark avatar Oct 15 '20 22:10 kvark

Hang is unfortunately still occurring in the latest latest 2 Nvidia driver versions, 457.09 and 457.30 (November 2020) and https://github.com/gfx-rs/wgpu-rs/commit/2563f2083720b3553c9e41ae3216b3acf06bfcff

Herschel avatar Nov 11 '20 22:11 Herschel

Here are minimal repro traces that hang on my machine in player.

wgpu-vulkan-hang.zip

trace_good draws a single red triangle and replays successfully. trace_bad tries to draw a second red triangle. This displays a blank screen and hangs during recording & playback. (trace_bad was manually edited to add the closing ] to the RON file).

This only occurs with the vulkan backend. The traces using the dx12 backend replay correctly.

The diff between the two traces boils down to the additional buffer creation and render pass.

Removing the buffer copy in trace-bad trace.ron line 442 causes the replay to run successfully (but only displays one triangle):

    CopyBufferToBuffer(
        src: Id(7, 1, Vulkan),
        src_offset: 0,
        dst: Id(1, 1, Vulkan),
        dst_offset: 0,
        size: 64,
    ),

How to create the trace:

  • git clone -b wgpu-hang-repro https://github.com/Herschel/ruffle
  • cd ruffle/desktop
  • cargo run --features=render_trace -- -g vulkan --trace-path=trace-bad test.swf

Running the trace using wgpu/player: cargo run --features=winit -- trace-bad

Windows 10 64-bit Nvidia Geforce 2080 Ti, Game Ready Driver 457.30 VulkanSDK 1.2.154.1 https://github.com/gfx-rs/wgpu-rs/commit/2563f2083720b3553c9e41ae3216b3acf06bfcff

Herschel avatar Nov 12 '20 00:11 Herschel

Do you guys have an API trace for a fresh version of wgpu by any chance to reproduce it?

kvark avatar Apr 13 '21 03:04 kvark

Interestingly, the "trace-bad" is replayed without issues here on "GTX 1050 Ti Max-Q" driver version 27.21.14.5256. @Herschel could you check if you are still seeing it?

kvark avatar Apr 13 '21 03:04 kvark