wgpu
wgpu copied to clipboard
Compute shader crash when doing sequential read and write from/to STORAGE buffer larger than 8MB
Description In my compute shader, I do a read from a STORAGE buffer, followed by an operation, followed by a write back to that buffer.
var a: u32 = lots_of_data[i];
result = max(result, a);
lots_of_data[i] = result;
On Linux with an Nvidia card using the Vulkan backend, this causes a parent device is lost
error to be thrown, indicating the GPU has crashed. On Mac OS using Metal, the shader simply returns no data and the WindowServer process uses 100% GPU until until I reboot my machine.
Repro steps
Checkout this commit, and run cargo run --example hello-compute
.
https://github.com/georgemorgan/wgpu/commit/66391306790c3ade21d49cb2d944965755f8e094
Expected vs observed behavior
Expected behavior is the compute shader returns 60000
u32 with value 123
. Observed behavior is that it returns the initial data in the buffer (1-59999), indicating that no work was done - and the GPU crashes.
Comment out the line lots_of_data[i] = result;
in the shader and run it again. The GPU will not crash, and will return the expected 60k element array of 123
.
Platform
Adapter 0:
Backend: Metal
Name: "Apple M1 Pro"
VendorID: 0
DeviceID: 0
Type: IntegratedGpu
Compliant: true
Features:
DEPTH_CLIP_CONTROL: true
TEXTURE_COMPRESSION_BC: true
INDIRECT_FIRST_INSTANCE: true
TIMESTAMP_QUERY: false
PIPELINE_STATISTICS_QUERY: false
MAPPABLE_PRIMARY_BUFFERS: true
TEXTURE_BINDING_ARRAY: true
BUFFER_BINDING_ARRAY: false
STORAGE_RESOURCE_BINDING_ARRAY: true
SAMPLED_TEXTURE_AND_STORAGE_BUFFER_ARRAY_NON_UNIFORM_INDEXING: true
UNIFORM_BUFFER_AND_STORAGE_TEXTURE_ARRAY_NON_UNIFORM_INDEXING: true
PARTIALLY_BOUND_BINDING_ARRAY: false
UNSIZED_BINDING_ARRAY: false
MULTI_DRAW_INDIRECT: false
MULTI_DRAW_INDIRECT_COUNT: false
PUSH_CONSTANTS: true
ADDRESS_MODE_CLAMP_TO_BORDER: true
POLYGON_MODE_LINE: true
POLYGON_MODE_POINT: false
TEXTURE_COMPRESSION_ETC2: true
TEXTURE_COMPRESSION_ASTC_LDR: true
TEXTURE_ADAPTER_SPECIFIC_FORMAT_FEATURES: true
SHADER_FLOAT64: false
VERTEX_ATTRIBUTE_64BIT: false
CONSERVATIVE_RASTERIZATION: false
VERTEX_WRITABLE_STORAGE: true
CLEAR_TEXTURE: true
SPIRV_SHADER_PASSTHROUGH: false
SHADER_PRIMITIVE_INDEX: false
MULTIVIEW: false
TEXTURE_FORMAT_16BIT_NORM: true
ADDRESS_MODE_CLAMP_TO_ZERO: true
TEXTURE_COMPRESSION_ASTC_HDR: true
Limits:
Max Texture Dimension 1d: 16384
Max Texture Dimension 2d: 16384
Max Texture Dimension 3d: 2048
Max Texture Array Layers: 2048
Max Bind Groups: 8
Max Dynamic Uniform Buffers Per Pipeline Layout: 8
Max Dynamic Storage Buffers Per Pipeline Layout: 4
Max Sampled Textures Per Shader Stage: 16
Max Samplers Per Shader Stage: 1024
Max Storage Buffers Per Shader Stage: 8
Max Storage Textures Per Shader Stage: 8
Max Uniform Buffers Per Shader Stage: 12
Max Uniform Buffer Binding Size: 4294967295
Max Storage Buffer Binding Size: 4294967295
Max Vertex Buffers: 8
Max Vertex Attributes: 16
Max Vertex Buffer Array Stride: 2048
Max Push Constant Size: 4096
Min Uniform Buffer Offset Alignment: 256
Min Storage Buffer Offset Alignment: 256
Max Inter-Stage Shader Component: 128
Max Compute Workgroup Storage Size: 65536
Max Compute Invocations Per Workgroup: 1024
Max Compute Workgroup Size X: 256
Max Compute Workgroup Size Y: 256
Max Compute Workgroup Size Z: 64
Max Compute Workgroups Per Dimension: 65535
Downlevel Properties:
Shader Model: Sm5
COMPUTE_SHADERS: true
FRAGMENT_WRITABLE_STORAGE: true
INDIRECT_EXECUTION: true
BASE_VERTEX: true
READ_ONLY_DEPTH_STENCIL: true
NON_POWER_OF_TWO_MIPMAPPED_TEXTURES: true
CUBE_ARRAY_TEXTURES: true
COMPARISON_SAMPLERS: true
INDEPENDENT_BLEND: true
VERTEX_STORAGE: true
ANISOTROPIC_FILTERING: true
FRAGMENT_STORAGE: true
MULTISAMPLED_SHADING: true
DEPTH_TEXTURE_AND_BUFFER_COPIES: true
Just a random guess: have you verified that this doesn't run into the OS timeout given the huge workload (60k workgroups with 8M iterations per workgroup)? Commenting out the write operation probably allows the driver to DCE the loop in the shader.
Just a random guess: have you verified that this doesn't run into the OS timeout given the huge workload (60k workgroups with 8M iterations per workgroup)? Commenting out the write operation probably allows the driver to DCE the loop in the shader.
Hmm, yeah that could totally be the problem. That would explain the visual hitch I get each time I run it. That may be the OS resetting the card. How would I get around that? Run fewer workgroups? I want to ensure the card is at 100% util if I can; I figured the driver / OS would preempt the shader execution to have the card do other work instead of just totally resetting it.
If you just want to run it locally there is probably a few to manually disable the timeout. In general you can try splitting it over multiple dispatches and ideally also split the workload done per shader - I guess the 8M loop iterations are more troublesome in this case.
Run on the master branch of wgpu using M1 Mac,crashed on vk backend too. it works on metal backend, but the output are wrong:
... 59953, 59954, 59955, 59956, 59957, 59958, 59959, 59960, 59961, 59962, 59963, 59964, 59965, 59966, 59967, 59968, 59969, 59970, 59971, 59972, 59973, 59974, 59975, 59976, 59977, 59978, 59979, 59980, 59981, 59982, 59983, 59984, 59985, 59986, 59987, 59988, 59989, 59990, 59991, 59992, 59993, 59994, 59995, 59996, 59997, 59998, 59999]
If slightly change shader code from
result = max(result, a);
lots_of_data[i] = result;
to:
lots_of_data[i] = max(result, a);
both backends work fine and output the correct results:
... 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123]