burn Queue::submit: Validation Error caused by Parent device is lost

trafficstars

Describe the bug I'm trying to create an implementation of the YOLO9000/YOLOv2 network in Burn with the WGPU backend. Creating the network works and I can also run it, but I can't use batch sizes larger than 8. When I use them, the thread panics with the message:

Click to expand

    Finished dev [unoptimized + debuginfo] target(s) in 0.12s
     Running `target/debug/burn-wgpu-yolov2-mve`
Model parameter count: 21150400
Testing batch size 1
Batch output shape: Shape { dims: [1, 1280] }
Warmup epoch took 2186.857ms
Epoch 1 took 1015.232ms and reached loss 6400.792
Epoch 2 took 1033.960ms and reached loss 55.710
Epoch 3 took 1006.282ms and reached loss 128.202
Epoch 4 took 1004.970ms and reached loss 213.439
Average time per batch at batch_size=1: 1015.109ms
Testing batch size 4
Batch output shape: Shape { dims: [4, 1280] }
Warmup epoch took 16309.362ms
Epoch 1 took 3608.391ms and reached loss 74.053
Epoch 2 took 3622.630ms and reached loss 31.859
Epoch 3 took 3617.320ms and reached loss 7.854
Epoch 4 took 3625.252ms and reached loss 1.124
Average time per batch at batch_size=4: 3618.397ms
Testing batch size 8
Batch output shape: Shape { dims: [8, 1280] }
Warmup epoch took 22324.035ms
Epoch 1 took 5874.058ms and reached loss 0.000
thread 'main' panicked at /home/valentin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:2327:30:
Error in Queue::submit: Validation Error

Caused by:
    Parent device is lost

stack backtrace:
   0:     0x565531b9944c - std::backtrace_rs::backtrace::libunwind::trace::ha69d38c49f1bf263
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x565531b9944c - std::backtrace_rs::backtrace::trace_unsynchronized::h93125d0b85fd543c
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x565531b9944c - std::sys_common::backtrace::_print_fmt::h8d65f438e8343444
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x565531b9944c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h41751d2af6c8033a
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x565531bc0abc - core::fmt::rt::Argument::fmt::h5db2f552d8a28f63
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/fmt/rt.rs:138:9
   5:     0x565531bc0abc - core::fmt::write::h99465148a27e4883
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/fmt/mod.rs:1114:21
   6:     0x565531b96eee - std::io::Write::write_fmt::hee8dfd57bd179ab2
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/io/mod.rs:1763:15
   7:     0x565531b99234 - std::sys_common::backtrace::_print::h019a3cee3e814da4
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x565531b99234 - std::sys_common::backtrace::print::h55694121c2ddf918
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x565531b9a7b3 - std::panicking::default_hook::{{closure}}::h29cbe3da3891b0b0
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:272:22
  10:     0x565531b9a4d4 - std::panicking::default_hook::h881e76b2b8c74280
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:292:9
  11:     0x565531b9ad35 - std::panicking::rust_panic_with_hook::hcc36e25b6e33969c
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:731:13
  12:     0x565531b9ac31 - std::panicking::begin_panic_handler::{{closure}}::ha415efb0f69f41f9
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:609:13
  13:     0x565531b99976 - std::sys_common::backtrace::__rust_end_short_backtrace::h395fe90f99451e4e
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:170:18
  14:     0x565531b9a982 - rust_begin_unwind
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5
  15:     0x565531154915 - core::panicking::panic_fmt::h452a83e54ecd764e
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14
  16:     0x56553142c243 - wgpu::backend::direct::Context::handle_error_fatal::h56904503034c4b34
                               at /home/valentin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:354:9
  17:     0x565531448a66 - <wgpu::backend::direct::Context as wgpu::context::Context>::queue_submit::h2eb4ae4d1f65332f
                               at /home/valentin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:2327:25
  18:     0x565531453627 - <T as wgpu::context::DynContext>::queue_submit::h4f7aabf4d4ebdac4
                               at /home/valentin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/context.rs:3084:13
  19:     0x5655311f3767 - wgpu::Queue::submit::h9899a495de8bf9dd
                               at /home/valentin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/lib.rs:4808:27
  20:     0x56553122c086 - burn_wgpu::compute::server::WgpuServer<MM>::submit::he445d543e641ddbe
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-wgpu/src/compute/server.rs:87:9
  21:     0x56553122b234 - burn_wgpu::compute::server::WgpuServer<MM>::buffer_reader::he9bc9bfa30a0e85d
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-wgpu/src/compute/server.rs:207:9
  22:     0x56553122a6b0 - <burn_wgpu::compute::server::WgpuServer<MM> as burn_compute::server::ComputeServer>::read::hc40c1d17e4aff434
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-wgpu/src/compute/server.rs:272:26
  23:     0x5655312382a3 - <burn_compute::channel::mutex::MutexComputeChannel<Server> as burn_compute::channel::base::ComputeChannel<Server>>::read::h0b5f35f97c1b1ae1
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-compute/src/channel/mutex.rs:39:9
  24:     0x5655311a679a - burn_compute::client::ComputeClient<Server,Channel>::read::he9387d7a82e8ac07
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-compute/src/client.rs:51:9
  25:     0x565531209173 - burn_wgpu::ops::base::into_data::h5320b551a2432787
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-wgpu/src/ops/base.rs:20:5
  26:     0x5655311bc96d - burn_wgpu::ops::float_ops::<impl burn_tensor::tensor::ops::tensor::TensorOps<burn_wgpu::backend::Wgpu<G,F,I>> for burn_wgpu::backend::Wgpu<G,F,I>>::into_data::h38a0a7436dc445bb
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-wgpu/src/ops/float_ops.rs:59:9
  27:     0x5655311dcdcd - burn_autodiff::ops::tensor::<impl burn_tensor::tensor::ops::tensor::TensorOps<burn_autodiff::backend::Autodiff<B>> for burn_autodiff::backend::Autodiff<B>>::into_data::h74b3215ce9b8cefb
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-autodiff/src/ops/tensor.rs:53:9
  28:     0x56553124d48d - <burn_tensor::tensor::api::kind::Float as burn_tensor::tensor::api::base::BasicOps<B>>::into_data::hf9fbeb37dc7db254
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-tensor/src/tensor/api/base.rs:1188:9
  29:     0x5655311e8bcf - burn_tensor::tensor::api::base::Tensor<B,_,K>::into_data::hbc4c9c22a665eaae
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-tensor/src/tensor/api/base.rs:397:9
  30:     0x5655311ea047 - burn_tensor::tensor::api::numeric::<impl burn_tensor::tensor::api::base::Tensor<B,_,K>>::into_scalar::h5b9cc6b8d61f86c6
                               at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-tensor/src/tensor/api/numeric.rs:20:20
  31:     0x5655311b7bf8 - burn_wgpu_yolov2_mve::main::hf200775014db5aeb
                               at /home/valentin/Desktop/GEOMAR/burn-wgpu-yolov2-mve/src/main.rs:37:123
  32:     0x5655311f7fdb - core::ops::function::FnOnce::call_once::h95708e1dc9599e47
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/ops/function.rs:250:5
  33:     0x56553124e7ce - std::sys_common::backtrace::__rust_begin_short_backtrace::hcb4c88e875e70d6b
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:154:18
  34:     0x565531209ae1 - std::rt::lang_start::{{closure}}::h7ed3af82f9047e2b
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/rt.rs:166:18
  35:     0x565531b93dab - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h14c5f6d1cd70a60f
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/ops/function.rs:284:13
  36:     0x565531b93dab - std::panicking::try::do_call::h2d02374ca451446a
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:504:40
  37:     0x565531b93dab - std::panicking::try::h9f7922394bf57392
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:468:19
  38:     0x565531b93dab - std::panic::catch_unwind::ha1600f9dd4ee7270
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panic.rs:142:14
  39:     0x565531b93dab - std::rt::lang_start_internal::{{closure}}::hfbd80e7d681b21a1
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/rt.rs:148:48
  40:     0x565531b93dab - std::panicking::try::do_call::heacaa33dbdaa16e0
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:504:40
  41:     0x565531b93dab - std::panicking::try::h637875f7c9db85ea
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:468:19
  42:     0x565531b93dab - std::panic::catch_unwind::h4caa9c0c78cb4c19
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panic.rs:142:14
  43:     0x565531b93dab - std::rt::lang_start_internal::h2d6a60ec944b523d
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/rt.rs:148:20
  44:     0x565531209aba - std::rt::lang_start::hfb8b486eaef989a8
                               at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/rt.rs:165:17
  45:     0x5655311b809e - main
  46:     0x7f7d5c0280d0 - __libc_start_call_main
                               at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  47:     0x7f7d5c028189 - __libc_start_main_impl
                               at ./csu/../csu/libc-start.c:360:3
  48:     0x5655311551e5 - _start
  49:                0x0 - <unknown>

To Reproduce

Steps to reproduce the behavior:

Clone the project from https://github.com/apexys/burn-wgpu-yolov2-mve
Run RUST_BACKTRACE="full" cargo run &> test.log
Batch sizes 1 and 4 work, batch size 8 crashes
Try reducing the number of parallel tasks with BURN_WGPU_MAX_TASKS=1 RUST_BACKTRACE="full" cargo run &> test.log
Batch size 8 also crashes

Expected behavior Independent of the batch size, the network just works.

Desktop (please complete the following information):

OS: Ubuntu 23.10, Nvidia GTX 1060 6GB
Version: (current master @ 1fd07fcb)

Other desktop:

OS: Windows 10, AMD RX6900XT 16GB
Version: (current master @ 1fd07fcb)

Additional context I don't think this is a memory issue - the network uses less than 4GB GPU-RAM as per nvidia-smi. On the AMD computer, I can even go up to batch size 32 with around 6GB of GPU-RAM used, but it still crashes when I increase the batch size further. Do I need to manually sync somewhere? Can I disable autotune somehow?

Dec 21 '23 11:12 apexys

@apexys One of the limitations of wgpu is that the amount of memory per shader has a limit, unrelated to the amount of memory available on the GPU, so in this scenario, it might be the problem. This limit isn't the same for all computers, and probably not the same with different graphics drivers.

What you might try to do is to disable Fusion since we may merge many shaders into one, reaching the limit. We need to improve error messages and consider the memory limit of the current GPU within our Fusion approach.

Let me know if it helps.

Dec 21 '23 17:12 nathanielsimard

Thanks for the quick reply! I tried running without Fusion and it got exactly one iteration further (so it crashes on the second round of Batch-Size 8 instead of the first one). I don't suppose I can edit this memory amount anywhere?

Dec 21 '23 17:12 apexys

It might also very well be a vulkan driver problem, I managed to completely crash my Wayland session several times running this code. Not really what I expected running user space code, but that doesn't seem like a problem of Burn.

Dec 21 '23 17:12 apexys

I don't think you can change that value manually. Though, if you find a way, let us know, we could do that automatically!

Dec 21 '23 17:12 nathanielsimard

burn burn copied to clipboard

Queue::submit: Validation Error caused by Parent device is lost

burn
burn copied to clipboard