burn
burn copied to clipboard
Queue::submit: Validation Error caused by Parent device is lost
Describe the bug I'm trying to create an implementation of the YOLO9000/YOLOv2 network in Burn with the WGPU backend. Creating the network works and I can also run it, but I can't use batch sizes larger than 8. When I use them, the thread panics with the message:
Click to expand
Finished dev [unoptimized + debuginfo] target(s) in 0.12s
Running `target/debug/burn-wgpu-yolov2-mve`
Model parameter count: 21150400
Testing batch size 1
Batch output shape: Shape { dims: [1, 1280] }
Warmup epoch took 2186.857ms
Epoch 1 took 1015.232ms and reached loss 6400.792
Epoch 2 took 1033.960ms and reached loss 55.710
Epoch 3 took 1006.282ms and reached loss 128.202
Epoch 4 took 1004.970ms and reached loss 213.439
Average time per batch at batch_size=1: 1015.109ms
Testing batch size 4
Batch output shape: Shape { dims: [4, 1280] }
Warmup epoch took 16309.362ms
Epoch 1 took 3608.391ms and reached loss 74.053
Epoch 2 took 3622.630ms and reached loss 31.859
Epoch 3 took 3617.320ms and reached loss 7.854
Epoch 4 took 3625.252ms and reached loss 1.124
Average time per batch at batch_size=4: 3618.397ms
Testing batch size 8
Batch output shape: Shape { dims: [8, 1280] }
Warmup epoch took 22324.035ms
Epoch 1 took 5874.058ms and reached loss 0.000
thread 'main' panicked at /home/valentin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:2327:30:
Error in Queue::submit: Validation Error
Caused by:
Parent device is lost
stack backtrace:
0: 0x565531b9944c - std::backtrace_rs::backtrace::libunwind::trace::ha69d38c49f1bf263
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: 0x565531b9944c - std::backtrace_rs::backtrace::trace_unsynchronized::h93125d0b85fd543c
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x565531b9944c - std::sys_common::backtrace::_print_fmt::h8d65f438e8343444
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:67:5
3: 0x565531b9944c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h41751d2af6c8033a
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:44:22
4: 0x565531bc0abc - core::fmt::rt::Argument::fmt::h5db2f552d8a28f63
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/fmt/rt.rs:138:9
5: 0x565531bc0abc - core::fmt::write::h99465148a27e4883
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/fmt/mod.rs:1114:21
6: 0x565531b96eee - std::io::Write::write_fmt::hee8dfd57bd179ab2
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/io/mod.rs:1763:15
7: 0x565531b99234 - std::sys_common::backtrace::_print::h019a3cee3e814da4
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:47:5
8: 0x565531b99234 - std::sys_common::backtrace::print::h55694121c2ddf918
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:34:9
9: 0x565531b9a7b3 - std::panicking::default_hook::{{closure}}::h29cbe3da3891b0b0
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:272:22
10: 0x565531b9a4d4 - std::panicking::default_hook::h881e76b2b8c74280
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:292:9
11: 0x565531b9ad35 - std::panicking::rust_panic_with_hook::hcc36e25b6e33969c
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:731:13
12: 0x565531b9ac31 - std::panicking::begin_panic_handler::{{closure}}::ha415efb0f69f41f9
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:609:13
13: 0x565531b99976 - std::sys_common::backtrace::__rust_end_short_backtrace::h395fe90f99451e4e
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:170:18
14: 0x565531b9a982 - rust_begin_unwind
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5
15: 0x565531154915 - core::panicking::panic_fmt::h452a83e54ecd764e
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14
16: 0x56553142c243 - wgpu::backend::direct::Context::handle_error_fatal::h56904503034c4b34
at /home/valentin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:354:9
17: 0x565531448a66 - <wgpu::backend::direct::Context as wgpu::context::Context>::queue_submit::h2eb4ae4d1f65332f
at /home/valentin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:2327:25
18: 0x565531453627 - <T as wgpu::context::DynContext>::queue_submit::h4f7aabf4d4ebdac4
at /home/valentin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/context.rs:3084:13
19: 0x5655311f3767 - wgpu::Queue::submit::h9899a495de8bf9dd
at /home/valentin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/lib.rs:4808:27
20: 0x56553122c086 - burn_wgpu::compute::server::WgpuServer<MM>::submit::he445d543e641ddbe
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-wgpu/src/compute/server.rs:87:9
21: 0x56553122b234 - burn_wgpu::compute::server::WgpuServer<MM>::buffer_reader::he9bc9bfa30a0e85d
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-wgpu/src/compute/server.rs:207:9
22: 0x56553122a6b0 - <burn_wgpu::compute::server::WgpuServer<MM> as burn_compute::server::ComputeServer>::read::hc40c1d17e4aff434
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-wgpu/src/compute/server.rs:272:26
23: 0x5655312382a3 - <burn_compute::channel::mutex::MutexComputeChannel<Server> as burn_compute::channel::base::ComputeChannel<Server>>::read::h0b5f35f97c1b1ae1
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-compute/src/channel/mutex.rs:39:9
24: 0x5655311a679a - burn_compute::client::ComputeClient<Server,Channel>::read::he9387d7a82e8ac07
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-compute/src/client.rs:51:9
25: 0x565531209173 - burn_wgpu::ops::base::into_data::h5320b551a2432787
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-wgpu/src/ops/base.rs:20:5
26: 0x5655311bc96d - burn_wgpu::ops::float_ops::<impl burn_tensor::tensor::ops::tensor::TensorOps<burn_wgpu::backend::Wgpu<G,F,I>> for burn_wgpu::backend::Wgpu<G,F,I>>::into_data::h38a0a7436dc445bb
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-wgpu/src/ops/float_ops.rs:59:9
27: 0x5655311dcdcd - burn_autodiff::ops::tensor::<impl burn_tensor::tensor::ops::tensor::TensorOps<burn_autodiff::backend::Autodiff<B>> for burn_autodiff::backend::Autodiff<B>>::into_data::h74b3215ce9b8cefb
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-autodiff/src/ops/tensor.rs:53:9
28: 0x56553124d48d - <burn_tensor::tensor::api::kind::Float as burn_tensor::tensor::api::base::BasicOps<B>>::into_data::hf9fbeb37dc7db254
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-tensor/src/tensor/api/base.rs:1188:9
29: 0x5655311e8bcf - burn_tensor::tensor::api::base::Tensor<B,_,K>::into_data::hbc4c9c22a665eaae
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-tensor/src/tensor/api/base.rs:397:9
30: 0x5655311ea047 - burn_tensor::tensor::api::numeric::<impl burn_tensor::tensor::api::base::Tensor<B,_,K>>::into_scalar::h5b9cc6b8d61f86c6
at /home/valentin/.cargo/git/checkouts/burn-178c6829f420dae1/1fd07fc/burn-tensor/src/tensor/api/numeric.rs:20:20
31: 0x5655311b7bf8 - burn_wgpu_yolov2_mve::main::hf200775014db5aeb
at /home/valentin/Desktop/GEOMAR/burn-wgpu-yolov2-mve/src/main.rs:37:123
32: 0x5655311f7fdb - core::ops::function::FnOnce::call_once::h95708e1dc9599e47
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/ops/function.rs:250:5
33: 0x56553124e7ce - std::sys_common::backtrace::__rust_begin_short_backtrace::hcb4c88e875e70d6b
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:154:18
34: 0x565531209ae1 - std::rt::lang_start::{{closure}}::h7ed3af82f9047e2b
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/rt.rs:166:18
35: 0x565531b93dab - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h14c5f6d1cd70a60f
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/ops/function.rs:284:13
36: 0x565531b93dab - std::panicking::try::do_call::h2d02374ca451446a
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:504:40
37: 0x565531b93dab - std::panicking::try::h9f7922394bf57392
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:468:19
38: 0x565531b93dab - std::panic::catch_unwind::ha1600f9dd4ee7270
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panic.rs:142:14
39: 0x565531b93dab - std::rt::lang_start_internal::{{closure}}::hfbd80e7d681b21a1
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/rt.rs:148:48
40: 0x565531b93dab - std::panicking::try::do_call::heacaa33dbdaa16e0
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:504:40
41: 0x565531b93dab - std::panicking::try::h637875f7c9db85ea
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:468:19
42: 0x565531b93dab - std::panic::catch_unwind::h4caa9c0c78cb4c19
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panic.rs:142:14
43: 0x565531b93dab - std::rt::lang_start_internal::h2d6a60ec944b523d
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/rt.rs:148:20
44: 0x565531209aba - std::rt::lang_start::hfb8b486eaef989a8
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/rt.rs:165:17
45: 0x5655311b809e - main
46: 0x7f7d5c0280d0 - __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
47: 0x7f7d5c028189 - __libc_start_main_impl
at ./csu/../csu/libc-start.c:360:3
48: 0x5655311551e5 - _start
49: 0x0 - <unknown>
To Reproduce
Steps to reproduce the behavior:
- Clone the project from https://github.com/apexys/burn-wgpu-yolov2-mve
- Run
RUST_BACKTRACE="full" cargo run &> test.log - Batch sizes 1 and 4 work, batch size 8 crashes
- Try reducing the number of parallel tasks with
BURN_WGPU_MAX_TASKS=1 RUST_BACKTRACE="full" cargo run &> test.log - Batch size 8 also crashes
Expected behavior Independent of the batch size, the network just works.
Desktop (please complete the following information):
- OS: Ubuntu 23.10, Nvidia GTX 1060 6GB
- Version: (current master @ 1fd07fcb)
Other desktop:
- OS: Windows 10, AMD RX6900XT 16GB
- Version: (current master @ 1fd07fcb)
Additional context I don't think this is a memory issue - the network uses less than 4GB GPU-RAM as per nvidia-smi. On the AMD computer, I can even go up to batch size 32 with around 6GB of GPU-RAM used, but it still crashes when I increase the batch size further. Do I need to manually sync somewhere? Can I disable autotune somehow?
@apexys One of the limitations of wgpu is that the amount of memory per shader has a limit, unrelated to the amount of memory available on the GPU, so in this scenario, it might be the problem. This limit isn't the same for all computers, and probably not the same with different graphics drivers.
What you might try to do is to disable Fusion since we may merge many shaders into one, reaching the limit. We need to improve error messages and consider the memory limit of the current GPU within our Fusion approach.
Let me know if it helps.
Thanks for the quick reply! I tried running without Fusion and it got exactly one iteration further (so it crashes on the second round of Batch-Size 8 instead of the first one). I don't suppose I can edit this memory amount anywhere?
It might also very well be a vulkan driver problem, I managed to completely crash my Wayland session several times running this code. Not really what I expected running user space code, but that doesn't seem like a problem of Burn.
I don't think you can change that value manually. Though, if you find a way, let us know, we could do that automatically!