wgpu tracing regression in 0.20.0 vs 0.19.4

Description

Attempting to use copy_buffer_to_buffer in 0.20.0 crashes with:

thread 'main' panicked at /home/damocles/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-core-0.20.0/src/resource.rs:121:17:
called `Option::unwrap()` on a `None` value

The full backtrace from one of my test runs is:

stack backtrace:
   0:     0x5597354f12d2 - std::backtrace_rs::backtrace::libunwind::trace::he4ee80166a02c846
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
   1:     0x5597354f12d2 - std::backtrace_rs::backtrace::trace_unsynchronized::h476faccf57e88641
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x5597354f12d2 - std::sys_common::backtrace::_print_fmt::h430c922a77e7a59c
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x5597354f12d2 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hffecb437d922f988
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x55973551660c - core::fmt::rt::Argument::fmt::hf3df69369399bfa9
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/fmt/rt.rs:142:9
   5:     0x55973551660c - core::fmt::write::hd9a8d7d029f9ea1a
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/fmt/mod.rs:1153:17
   6:     0x5597354ef14f - std::io::Write::write_fmt::h0e1226b2b8d973fe
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/io/mod.rs:1843:15
   7:     0x5597354f10a4 - std::sys_common::backtrace::_print::hd2df4a083f6e69b8
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x5597354f10a4 - std::sys_common::backtrace::print::he907f6ad7eee41cb
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x5597354f255b - std::panicking::default_hook::{{closure}}::h3926193b61c9ca9b
  10:     0x5597354f22b3 - std::panicking::default_hook::h25ba2457dea68e65
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:292:9
  11:     0x5597354f29fd - std::panicking::rust_panic_with_hook::h0ad14d90dcf5224f
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:779:13
  12:     0x5597354f2899 - std::panicking::begin_panic_handler::{{closure}}::h4a1838a06f542647
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:649:13
  13:     0x5597354f17a6 - std::sys_common::backtrace::__rust_end_short_backtrace::h77cc4dc3567ca904
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:171:18
  14:     0x5597354f2604 - rust_begin_unwind
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5
  15:     0x5597348934e5 - core::panicking::panic_fmt::h940d4fd01a4b4fd1
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14
  16:     0x5597348935a3 - core::panicking::panic::h8ddd58dc57c2dc00
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:145:5
  17:     0x559734893486 - core::option::unwrap_failed::hf59153bb1e2fc334
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/option.rs:1985:5
  18:     0x559734e379a0 - core::option::Option<T>::unwrap::hdeb99919510551b3
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/option.rs:933:21
  19:     0x559734e379a0 - wgpu_core::resource::ResourceInfo<T>::id::he0c6517bd8e3f91d
                               at /home/damocles/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-core-0.20.0/src/resource.rs:121:9
  20:     0x559734e38def - <wgpu_core::resource::Buffer<A> as core::ops::drop::Drop>::drop::h8fe7e4be1a0f6653
                               at /home/damocles/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-core-0.20.0/src/resource.rs:404:52
  21:     0x559734de4da7 - core::ptr::drop_in_place<wgpu_core::resource::Buffer<wgpu_hal::vulkan::Api>>::ha1c38d8abfc5c79c
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  22:     0x559734eaf2ff - alloc::sync::Arc<T,A>::drop_slow::h52b7243041689c9a
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/alloc/src/sync.rs:1804:18
  23:     0x559734eb3232 - <alloc::sync::Arc<T,A> as core::ops::drop::Drop>::drop::ha775173f5482ce52
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/alloc/src/sync.rs:2459:13
  24:     0x559734dcd4bb - core::ptr::drop_in_place<alloc::sync::Arc<wgpu_core::resource::Buffer<wgpu_hal::vulkan::Api>>>::h50a1ca1948d557f6
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  25:     0x559734dd943f - core::ptr::drop_in_place<(wgpu_core::track::TrackerIndex,alloc::sync::Arc<wgpu_core::resource::Buffer<wgpu_hal::vulkan::Api>>)>::hcb77ccecad10dfdc
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  26:     0x559734c80e72 - core::ptr::mut_ptr::<impl *mut T>::drop_in_place::hf563684f0ac0247e
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mut_ptr.rs:1473:18
  27:     0x559734c80e72 - hashbrown::raw::Bucket<T>::drop::h82efac29a399c8c8
                               at /rust/deps/hashbrown-0.14.3/src/raw/mod.rs:590:23
  28:     0x559734c78498 - hashbrown::raw::RawTableInner::drop_elements::hc20a52aa5ac43ff3
                               at /rust/deps/hashbrown-0.14.3/src/raw/mod.rs:2379:17
  29:     0x559734c79b80 - hashbrown::raw::RawTableInner::drop_inner_table::h322098f924a4e6c2
                               at /rust/deps/hashbrown-0.14.3/src/raw/mod.rs:2434:17
  30:     0x559734c747fa - <hashbrown::raw::RawTable<T,A> as core::ops::drop::Drop>::drop::h8c81f5c568abf00a
                               at /rust/deps/hashbrown-0.14.3/src/raw/mod.rs:3678:13
  31:     0x559734ddca5b - core::ptr::drop_in_place<hashbrown::raw::RawTable<(wgpu_core::track::TrackerIndex,alloc::sync::Arc<wgpu_core::resource::Buffer<wgpu_hal::vulkan::Api>>)>>::h848d236953d9150c
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  32:     0x559734ddeddb - core::ptr::drop_in_place<hashbrown::map::HashMap<wgpu_core::track::TrackerIndex,alloc::sync::Arc<wgpu_core::resource::Buffer<wgpu_hal::vulkan::Api>>,core::hash::BuildHasherDefault<rustc_hash::FxHasher>>>::h268807d3d564515d
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  33:     0x559734ddf22b - core::ptr::drop_in_place<std::collections::hash::map::HashMap<wgpu_core::track::TrackerIndex,alloc::sync::Arc<wgpu_core::resource::Buffer<wgpu_hal::vulkan::Api>>,core::hash::BuildHasherDefault<rustc_hash::FxHasher>>>::h109c75fb36db2a51
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  34:     0x559734de8fe7 - core::ptr::drop_in_place<wgpu_core::device::life::ResourceMaps<wgpu_hal::vulkan::Api>>::h1e5585fb807c2963
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  35:     0x559734d671bd - wgpu_core::device::life::LifetimeTracker<A>::triage_submissions::h9b06f4999f3535f0
                               at /home/damocles/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-core-0.20.0/src/device/life.rs:413:9
  36:     0x559734e285a8 - wgpu_core::device::resource::Device<A>::maintain::ha6ad7b20c3de2f07
                               at /home/damocles/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-core-0.20.0/src/device/resource.rs:434:13
  37:     0x559734b8ddd8 - wgpu_core::device::queue::<impl wgpu_core::global::Global>::queue_submit::h95b283a3c900a2eb
                               at /home/damocles/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-core-0.20.0/src/device/queue.rs:1555:39
  38:     0x559734b4c13e - <wgpu::backend::wgpu_core::ContextWgpuCore as wgpu::context::Context>::queue_submit::he97dd3b53d285061
                               at /home/damocles/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.20.0/src/backend/wgpu_core.rs:2260:27
  39:     0x559734b57423 - <T as wgpu::context::DynContext>::queue_submit::hb2174c2d1030c8b5
                               at /home/damocles/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.20.0/src/context.rs:3025:13
  40:     0x5597348a09e5 - wgpu::Queue::submit::hfab63398e5d30aab
                               at /home/damocles/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.20.0/src/lib.rs:4981:27
  41:     0x5597348a4980 - alan_generated_bin::read_buffer::hbedf71215769391d
                               at /home/damocles/.config/alan/alan_generated_bin/src/main.rs:266:5
  42:     0x5597348a50b8 - alan_generated_bin::main::h1e3a29ab7fcab87a
                               at /home/damocles/.config/alan/alan_generated_bin/src/main.rs:309:20
  43:     0x559734894f6b - core::ops::function::FnOnce::call_once::h5fce3699794672b3
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  44:     0x55973489c71e - std::sys_common::backtrace::__rust_begin_short_backtrace::h6179380494bff8d2
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:155:18
  45:     0x5597348a1441 - std::rt::lang_start::{{closure}}::hf0241278dd9be494
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:166:18
  46:     0x5597354eb253 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h52f5991f9ab8b369
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:284:13
  47:     0x5597354eb253 - std::panicking::try::do_call::h0ac4bee9a397a1bf
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  48:     0x5597354eb253 - std::panicking::try::hc005decaf198d0ed
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  49:     0x5597354eb253 - std::panic::catch_unwind::hb0f967d870b2a382
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  50:     0x5597354eb253 - std::rt::lang_start_internal::{{closure}}::hd140b84b0efe534b
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:48
  51:     0x5597354eb253 - std::panicking::try::do_call::h1ddfaf1d0d576c38
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  52:     0x5597354eb253 - std::panicking::try::hdd4bdf855547659f
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  53:     0x5597354eb253 - std::panic::catch_unwind::h276ba91c7706110c
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  54:     0x5597354eb253 - std::rt::lang_start_internal::h103c42a9c4e95084
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:20
  55:     0x5597348a141a - std::rt::lang_start::hce91f7cfea2f3ec4
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:165:17
  56:     0x5597348a536e - main
  57:     0x7f6bdeeea088 - __libc_start_call_main
  58:     0x7f6bdeeea14b - __libc_start_main_impl
  59:     0x559734893c85 - _start
  60:                0x0 - <unknown>

Somehow, something internal to wgpu doesn't have an ID. I temporarily added #derive(Debug) to my own structures and debug logged the buffers I'm passing to copy_buffer_to_buffer and they all had IDs, so I'm not sure what exactly is going on, but looking a bit higher up the stack, it looks like it's related to the automatic GPU resource cleanup logic in 0.20.0 though I don't understand why it would be triggered.

Repro steps I put a trimmed version of the code in a gist you just need to copy the main.rs file to a src/main.rs in a normal Rust project to test it.

Expected vs observed behavior This code, (with minor modifications to remove the compilation_options field from the ComputePipelineDescriptor, compiles and runs successfully on 0.19.4, but crashes on 0.20.0

Extra materials

I include the trace.zip it generated.

Platform I've tested this on Fedora/x86-64 and Debian/RISC-V with the same results, only wgpu version 0.20.0 is affected.

Jun 10 '24 15:06 dfellis

I edited my local cargo cache to insert a debug log on the buffer that's being set to be freed that is crashing things, which you can see below (with some hand formatting for better legibility:

Buffer {
  raw: <snatchable>,
  device: Device {
    adapter: "<Adapter>",
    limits: Limits {
       max_texture_dimension_1d: 16384,
       max_texture_dimension_2d: 16384,
       max_texture_dimension_3d: 2048,
       max_texture_array_layers: 2048,
       max_bind_groups: 8,
       max_bindings_per_bind_group: 1000,
       max_dynamic_uniform_buffers_per_pipeline_layout: 16,
       max_dynamic_storage_buffers_per_pipeline_layout: 8,
       max_sampled_textures_per_shader_stage: 8388606,
       max_samplers_per_shader_stage: 8388606,
       max_storage_buffers_per_shader_stage: 8388606,
       max_storage_textures_per_shader_stage: 8388606,
       max_uniform_buffers_per_shader_stage: 8388606,
       max_uniform_buffer_binding_size: 2147483648,
       max_storage_buffer_binding_size: 2147483648,
       max_vertex_buffers: 16,
       max_buffer_size: 2147483647,
       max_vertex_attributes: 32,
       max_vertex_buffer_array_stride: 2048,
       min_uniform_buffer_offset_alignment: 32,
       min_storage_buffer_offset_alignment: 32,
       max_inter_stage_shader_components: 128,
       max_color_attachments: 8,
       max_color_attachment_bytes_per_sample: 32,
       max_compute_workgroup_storage_size: 65536,
       max_compute_invocations_per_workgroup: 1024,
       max_compute_workgroup_size_x: 1024,
       max_compute_workgroup_size_y: 1024,
       max_compute_workgroup_size_z: 1024,
       max_compute_workgroups_per_dimension: 65535,
       min_subgroup_size: 64,
       max_subgroup_size: 64,
       max_push_constant_size: 256,
       max_non_sampler_bindings: 4294967295
    },
    features: Features(DEPTH_CLIP_CONTROL | DEPTH32FLOAT_STENCIL8 | TEXTURE_COMPRESSION_BC | TIMESTAMP_QUERY | INDIRECT_FIRST_INSTANCE | SHADER_F16 | RG11B10UFLOAT_RENDERABLE | BGRA8UNORM_STORAGE | FLOAT32_FILTERABLE | TEXTURE_FORMAT_16BIT_NORM | TEXTURE_ADAPTER_SPECIFIC_FORMAT_FEATURES | PIPELINE_STATISTICS_QUERY | TIMESTAMP_QUERY_INSIDE_ENCODERS | TIMESTAMP_QUERY_INSIDE_PASSES | MAPPABLE_PRIMARY_BUFFERS | TEXTURE_BINDING_ARRAY | BUFFER_BINDING_ARRAY | STORAGE_RESOURCE_BINDING_ARRAY | SAMPLED_TEXTURE_AND_STORAGE_BUFFER_ARRAY_NON_UNIFORM_INDEXING | UNIFORM_BUFFER_AND_STORAGE_TEXTURE_ARRAY_NON_UNIFORM_INDEXING | PARTIALLY_BOUND_BINDING_ARRAY | MULTI_DRAW_INDIRECT | MULTI_DRAW_INDIRECT_COUNT | PUSH_CONSTANTS | ADDRESS_MODE_CLAMP_TO_ZERO | ADDRESS_MODE_CLAMP_TO_BORDER | POLYGON_MODE_LINE | POLYGON_MODE_POINT | CONSERVATIVE_RASTERIZATION | VERTEX_WRITABLE_STORAGE | CLEAR_TEXTURE | SPIRV_SHADER_PASSTHROUGH | MULTIVIEW | SHADER_UNUSED_VERTEX_OUTPUT | TEXTURE_FORMAT_NV12 | SHADER_F64 | SHADER_I16 | SHADER_PRIMITIVE_INDEX | DUAL_SOURCE_BLENDING | SHADER_INT64 | SUBGROUP | SUBGROUP_VERTEX | SUBGROUP_BARRIER),
    downlevel: DownlevelCapabilities {
      flags: DownlevelFlags(COMPUTE_SHADERS | FRAGMENT_WRITABLE_STORAGE | INDIRECT_EXECUTION | BASE_VERTEX | READ_ONLY_DEPTH_STENCIL | NON_POWER_OF_TWO_MIPMAPPED_TEXTURES | CUBE_ARRAY_TEXTURES | COMPARISON_SAMPLERS | INDEPENDENT_BLEND | VERTEX_STORAGE | ANISOTROPIC_FILTERING | FRAGMENT_STORAGE | MULTISAMPLED_SHADING | DEPTH_TEXTURE_AND_BUFFER_COPIES | WEBGPU_TEXTURE_FORMAT_SUPPORT | BUFFER_BINDINGS_NOT_16_BYTE_ALIGNED | UNRESTRICTED_INDEX_BUFFER | FULL_DRAW_INDEX_UINT32 | DEPTH_BIAS_CLAMP | VIEW_FORMATS | UNRESTRICTED_EXTERNAL_TEXTURE_COPIES | SURFACE_VIEW_FORMATS | NONBLOCKING_QUERY_RESOLVE | VERTEX_AND_INSTANCE_INDEX_RESPECTS_RESPECTIVE_FIRST_VALUE_IN_INDIRECT_DRAW),
      limits: DownlevelLimits,
      shader_model: Sm5
    }
  },
  usage: BufferUsages(MAP_WRITE | COPY_SRC),
  size: 16,
  initialization_status: RwLock { data: InitTracker { uninitialized_ranges: [] } },
  sync_mapped_writes: Mutex { data: None },
  info: ResourceInfo {
    id: None,
    tracker_index: TrackerIndex(1),
    tracker_indices: Some(SharedTrackerIndexAllocator { inner: Mutex { data:  } }),
    submission_index: 0,
    label: "(wgpu internal) initializing unmappable buffer"
  },
  map_state: Mutex { data: Idle },
  bind_groups: Mutex { data: [] }
}

I don't create a buffer with MAP_WRITE | COPY_SRC flags set, and the label "(wgpu internal)..." indicates this is probably something internal to the copy_buffer_to_buffer function. I still don't know how it has no ID, though.

Jun 10 '24 16:06 dfellis

So only one place creates a label with that name, the device_create_buffer in wgpu_core/src/device/global.rs

Some debug logging on the args there reveals:

desc: BufferDescriptor { label: None, size: 16, usage: BufferUsages(COPY_SRC | COPY_DST | STORAGE), mapped_at_creation: true }

The described buffer to create is supposedly the buffer I'm copying from, but by this point in the trace, that buffer should already exist.

But if I slap a seemingly useless MAP_WRITE onto that buffer, avoiding whatever this temporary buffer is, the code compiles and runs on 0.20.0.

So I think that's the end of my bug report for now, as I don't understand why this temporary buffer is needed when copying from this buffer, and I don't know why it's not getting a proper ID during creation, but I do have a workaround for the time being.

Jun 10 '24 17:06 dfellis

@dfellis: Just the context I'm aware of: We're in the middle of transitioning backend resources to being tracked only by Arc, rather than ID. There are, unfortunately, some places where we are still tracking by ID. When code that only keeps track of Arcs attempts to use APIs that use IDs, then the code has no choice but to panic, since we're definitely doing something we Shouldn't Do™.

I believe that the solution here is to progress in our migration of resource tracking code that uses Arcs instead of IDs.

CC @teoxoy, @jimblandy.

Jun 10 '24 17:06 ErichDonGubler

@ErichDonGubler understood. Do you know what the timeline is on that conversion?

I've realized that my hack to work around this won't cut it because it fails for the OpenGL backend since MAP_WRITE is only allowed to be paired with COPY_SRC. That it's even working at all on the Vulkan backend is probably itself a bug?

And with that, I probably have to hold off on upgrading until copy_buffer_to_buffer works without this failure, or I find a cross-backend workaround and leave a big TODO to try and move back to the normal API.

Jun 10 '24 18:06 dfellis

@dfellis: We don't currently have one, but if this conversion is blocking or regressing user code, there's a good chance we can justify prioritizing it!

I'll let others comment on further context here, since I don't have it. 😅

Jun 10 '24 19:06 ErichDonGubler

@ErichDonGubler got it! But in the meantime, I have finally realized what's actually causing the crash in copy_buffer_to_buffer and it's the tracing itself.

I turned on tracing when I couldn't get my code working on the RISC-V single board computer I bought to specifically try and catch bugs in my code from platform assumptions, and then started getting errors. (Hooray, purchase justified ;) )

In the meantime I figured out that the issue was the Vulkan driver on this SBC doesn't implement everything needed for wgpu so I added logic to scan all of the adapters and pick the first one that has true for is_webgpu_compliant, but I did that on a new branch off of my main, which had wgpu on 0.19.4 without tracing on, while the branch I was debugging on is 0.20.0 with tracing turned on.

With the apparent fix for 0.20.0 being to slap MAP_WRITE onto a buffer that it shouldn't be on, I started prepping that for actual merging by turning off tracing and tests continued to pass on my x86-64 machines, tried to run it on the RISC-V machine and I got the validation error that I'm configuring the buffer incorrectly.

Okay, I agree, so let's try and figure out how to replicate whatever copy_buffer_to_buffer is doing internally with a temporary MAP_WRITE buffer, so I created some extra temporary buffers and tried to insert them into the command queue, getting more errors that I'm doing things incorrectly when I was trying to write into the MAP_WRITE buffer so I could then use it to write out to another buffer, and then I just reverted all of the changes in that file and re-ran the failing test so I could get the stacktrace on my machine, and it just worked.

Tested it on the RISC-V SBC and it also worked there: the difference is just removing features = ["trace"] in the Cargo.toml file.

So now I would say my real bug report is that the trace feature is broken by this migration to Arc, because it looks like the trace output requires IDs? (See snippet from the trace below) And this breakage in trace then produces a super misleading rabbit hole to spend a couple of days on.

Submit(2, [
    CopyBufferToBuffer(
        src: Id(0, 1, Vulkan),
        src_offset: 0,
        dst: Id(1, 1, Vulkan),
        dst_offset: 0,
        size: 16,
    ),
]),

Jun 10 '24 19:06 dfellis

I think this should have been fixed by f2ea30772c5a7c6777aee0511dd9b7198eb61329 (https://github.com/gfx-rs/wgpu/pull/5871) (not part of any release yet). @dfellis could you confirm?

Jul 02 '24 10:07 teoxoy

So unfortunately I am not 100% sure that things are fixed.

Here is a screenshot of me re-branching off of the commit where I was trying to move to 0.20.0 at that time:

Screenshot from 2024-07-02 13-38-59

(I ran the test executable a second time outside of cargo test so we can see the failure in src/resource.rs:121)

Screenshot from 2024-07-02 14-09-02

When I run the test on the commit you pointed at, I get a bare segfault and trace.ron file ends with the same value as I originally reported.

Screenshot from 2024-07-02 14-24-39

So that looks like it's definitely still broken. But when I run the same commit with tracing on but also with the is_webgpu_compliant check, it succeeds.

Screenshot from 2024-07-02 14-16-47

Screenshot from 2024-07-02 14-31-45

So, I think tracing is working, but I'm not entirely sure because when I reproduce the exact same path as before where I use the buggy Vulkan driver, it still has the same output in trace.ron as before, but this time with a Segfault instead of an unwrap exploding, so perhaps it's just coincidental that the buggy Vulkan drivers are blowing up just after the tracing was before, or perhaps the tracing is just blowing up in a new and more exciting way?

I'm not really sure how to make that distinction, unfortunately.

Jul 02 '24 19:07 dfellis

I think the initial issue was resolved, with f2ea30772c5a7c6777aee0511dd9b7198eb61329 there will be no more id unwrapping.

so perhaps it's just coincidental that the buggy Vulkan drivers are blowing up just after the tracing was before, or perhaps the tracing is just blowing up in a new and more exciting way?

Since we now trace as soon as possible the segfault is probably in:

https://github.com/gfx-rs/wgpu/blob/f2ea30772c5a7c6777aee0511dd9b7198eb61329/wgpu-core/src/resource.rs#L478

Could you debug the segfault to see what's causing it?

Jul 03 '24 06:07 teoxoy

This has been fixed by f2ea30772c5a7c6777aee0511dd9b7198eb61329 (https://github.com/gfx-rs/wgpu/pull/5871). Please open a new issue for the segfault if it turns out to be due to an issue in our implementation.

Jul 03 '24 16:07 teoxoy

Hey, took a while to get back on this because my nvme drive on the machine died mid-debugging and I had to debug that first.

Anyways, it looks like it is blowing up inside of the VK driver so feel free to keep this closed.

Screenshot from 2024-07-03 13-57-54

Jul 03 '24 18:07 dfellis

No worries! Are you having issues with that GPU/driver in other apps as well? We could still be at fault for segfaults inside drivers if we use the API improperly.

Jul 03 '24 20:07 teoxoy

No worries! Are you having issues with that GPU/driver in other apps as well? We could still be at fault for segfaults inside drivers if we use the API improperly.

So how I resolved my issues on this test machine last month was to iterate through all of the adapters and filter out any adapter where is_webgpu_compliant returns false. I just didn't expect the listing of adapters to include non-compliant drivers by default, so I don't think the crash when using a non-compliant driver is "your fault" but I might want the adapters list to pre-filter by default and you have to manually opt-in for the non-compliant drivers where the developer really has to know what they're doing and know what wgpu is doing under the hood to use it safely.

The GPU works fine with the OpenGL drivers for my use case, and after digging into things, I don't think any software on the machine uses the Vulkan drivers. (The GUI is GNOME, it's running Wayland with the WM being Mutter, and Mutter uses OpenGL, not Vulkan. It's a RISC-V machine so I can't pull up Steam and try to run some games on it to test on that front via Proton.)

Here's the about and screenfetch for the machine. Probably doesn't bring anything to the table, but just in case:

Screenshot from 2024-07-03 16-04-56

Screenshot from 2024-07-03 16-05-01

After I post this, I'll try running these vulkan test applications that I just found (afterwards just in case they completely crash this machine) and I'll let you know the results.

Jul 03 '24 21:07 dfellis

Hmm... Nevermind on that. The instructions for building the example applications don't work because the applications all require an add_shader_library custom cmake function that isn't defined, and after digging in a bit, it seems that's part of their Android testing and building for a "normal" Linux is not really working anymore. I'll see if I can find anything else to test Vulkan with.

Jul 03 '24 21:07 dfellis

I just installed some demo Vulkan apps and they're all failing because the driver doesn't have the "VK_KHR_swapchain" extension.

Screenshot from 2024-07-03 16-43-49

The only thing I can find online about that being missing is that it needs to enabled at device instantiation time and I presume these example applications would "know" to do that?

So my suspicion that this device's Vulkan driver is simply broken seems more likely. It's running a weird fork of Debian provided by the SBC manufacturer, so it's likely an issue with the drivers they got from PowerVR or how they packaged them, so if I want to pursue this further, I should reach out to them, but as I said I am fine with the OpenGL backend for my needs.

Jul 03 '24 21:07 dfellis

So how I resolved my issues on this test machine last month was to iterate through all of the adapters and filter out any adapter where is_webgpu_compliant returns false. I just didn't expect the listing of adapters to include non-compliant drivers by default, so I don't think the crash when using a non-compliant driver is "your fault" but I might want the adapters list to pre-filter by default and you have to manually opt-in for the non-compliant drivers where the developer really has to know what they're doing and know what wgpu is doing under the hood to use it safely.

We do filter non-compliant Vulkan drivers out by default but not non-compliant WebGPU drivers since we have DownlevelFlags which tell you what functionality is missing that makes the device not WebGPU compliant. This is so that we have a wider reach, maybe we should reconsider this being the default but it's not something users have requested yet AFAIK.

The only thing I can find online about that being missing is that it needs to enabled at device instantiation time and I presume these example applications would "know" to do that?

They should, we enable it for example.

So my suspicion that this device's Vulkan driver is simply broken seems more likely. It's running a weird fork of Debian provided by the SBC manufacturer, so it's likely an issue with the drivers they got from PowerVR or how they packaged them, so if I want to pursue this further, I should reach out to them, but as I said I am fine with the OpenGL backend for my needs.

It does seem like something is misconfigured.

Jul 04 '24 07:07 teoxoy