wgpu Compute shader crash when doing sequential read and write from/to STORAGE buffer larger than 8MB

Description In my compute shader, I do a read from a STORAGE buffer, followed by an operation, followed by a write back to that buffer.

        var a: u32 = lots_of_data[i];
        result = max(result, a);
        lots_of_data[i] = result;

On Linux with an Nvidia card using the Vulkan backend, this causes a parent device is lost error to be thrown, indicating the GPU has crashed. On Mac OS using Metal, the shader simply returns no data and the WindowServer process uses 100% GPU until until I reboot my machine.

Repro steps

Checkout this commit, and run cargo run --example hello-compute.

https://github.com/georgemorgan/wgpu/commit/66391306790c3ade21d49cb2d944965755f8e094

Expected vs observed behavior Expected behavior is the compute shader returns 60000 u32 with value 123. Observed behavior is that it returns the initial data in the buffer (1-59999), indicating that no work was done - and the GPU crashes.

Comment out the line lots_of_data[i] = result; in the shader and run it again. The GPU will not crash, and will return the expected 60k element array of 123.

Platform

Adapter 0:
	Backend:   Metal
	Name:      "Apple M1 Pro"
	VendorID:  0
	DeviceID:  0
	Type:      IntegratedGpu
	Compliant: true
	Features:
		DEPTH_CLIP_CONTROL:                                             true
		TEXTURE_COMPRESSION_BC:                                         true
		INDIRECT_FIRST_INSTANCE:                                        true
		TIMESTAMP_QUERY:                                                false
		PIPELINE_STATISTICS_QUERY:                                      false
		MAPPABLE_PRIMARY_BUFFERS:                                       true
		TEXTURE_BINDING_ARRAY:                                          true
		BUFFER_BINDING_ARRAY:                                           false
		STORAGE_RESOURCE_BINDING_ARRAY:                                 true
		SAMPLED_TEXTURE_AND_STORAGE_BUFFER_ARRAY_NON_UNIFORM_INDEXING:  true
		UNIFORM_BUFFER_AND_STORAGE_TEXTURE_ARRAY_NON_UNIFORM_INDEXING:  true
		PARTIALLY_BOUND_BINDING_ARRAY:                                  false
		UNSIZED_BINDING_ARRAY:                                          false
		MULTI_DRAW_INDIRECT:                                            false
		MULTI_DRAW_INDIRECT_COUNT:                                      false
		PUSH_CONSTANTS:                                                 true
		ADDRESS_MODE_CLAMP_TO_BORDER:                                   true
		POLYGON_MODE_LINE:                                              true
		POLYGON_MODE_POINT:                                             false
		TEXTURE_COMPRESSION_ETC2:                                       true
		TEXTURE_COMPRESSION_ASTC_LDR:                                   true
		TEXTURE_ADAPTER_SPECIFIC_FORMAT_FEATURES:                       true
		SHADER_FLOAT64:                                                 false
		VERTEX_ATTRIBUTE_64BIT:                                         false
		CONSERVATIVE_RASTERIZATION:                                     false
		VERTEX_WRITABLE_STORAGE:                                        true
		CLEAR_TEXTURE:                                                  true
		SPIRV_SHADER_PASSTHROUGH:                                       false
		SHADER_PRIMITIVE_INDEX:                                         false
		MULTIVIEW:                                                      false
		TEXTURE_FORMAT_16BIT_NORM:                                      true
		ADDRESS_MODE_CLAMP_TO_ZERO:                                     true
		TEXTURE_COMPRESSION_ASTC_HDR:                                   true
	Limits:
		Max Texture Dimension 1d:                        16384
		Max Texture Dimension 2d:                        16384
		Max Texture Dimension 3d:                        2048
		Max Texture Array Layers:                        2048
		Max Bind Groups:                                 8
		Max Dynamic Uniform Buffers Per Pipeline Layout: 8
		Max Dynamic Storage Buffers Per Pipeline Layout: 4
		Max Sampled Textures Per Shader Stage:           16
		Max Samplers Per Shader Stage:                   1024
		Max Storage Buffers Per Shader Stage:            8
		Max Storage Textures Per Shader Stage:           8
		Max Uniform Buffers Per Shader Stage:            12
		Max Uniform Buffer Binding Size:                 4294967295
		Max Storage Buffer Binding Size:                 4294967295
		Max Vertex Buffers:                              8
		Max Vertex Attributes:                           16
		Max Vertex Buffer Array Stride:                  2048
		Max Push Constant Size:                          4096
		Min Uniform Buffer Offset Alignment:             256
		Min Storage Buffer Offset Alignment:             256
		Max Inter-Stage Shader Component:                128
		Max Compute Workgroup Storage Size:              65536
		Max Compute Invocations Per Workgroup:           1024
		Max Compute Workgroup Size X:                    256
		Max Compute Workgroup Size Y:                    256
		Max Compute Workgroup Size Z:                    64
		Max Compute Workgroups Per Dimension:            65535
	Downlevel Properties:
		Shader Model:                        Sm5
		COMPUTE_SHADERS:                     true
		FRAGMENT_WRITABLE_STORAGE:           true
		INDIRECT_EXECUTION:                  true
		BASE_VERTEX:                         true
		READ_ONLY_DEPTH_STENCIL:             true
		NON_POWER_OF_TWO_MIPMAPPED_TEXTURES: true
		CUBE_ARRAY_TEXTURES:                 true
		COMPARISON_SAMPLERS:                 true
		INDEPENDENT_BLEND:                   true
		VERTEX_STORAGE:                      true
		ANISOTROPIC_FILTERING:               true
		FRAGMENT_STORAGE:                    true
		MULTISAMPLED_SHADING:                true
		DEPTH_TEXTURE_AND_BUFFER_COPIES:     true

Mar 23 '22 18:03 georgemorgan

Just a random guess: have you verified that this doesn't run into the OS timeout given the huge workload (60k workgroups with 8M iterations per workgroup)? Commenting out the write operation probably allows the driver to DCE the loop in the shader.

Mar 24 '22 20:03 msiglreith

Just a random guess: have you verified that this doesn't run into the OS timeout given the huge workload (60k workgroups with 8M iterations per workgroup)? Commenting out the write operation probably allows the driver to DCE the loop in the shader.

Hmm, yeah that could totally be the problem. That would explain the visual hitch I get each time I run it. That may be the OS resetting the card. How would I get around that? Run fewer workgroups? I want to ensure the card is at 100% util if I can; I figured the driver / OS would preempt the shader execution to have the card do other work instead of just totally resetting it.

Mar 24 '22 21:03 georgemorgan

If you just want to run it locally there is probably a few to manually disable the timeout. In general you can try splitting it over multiple dispatches and ideally also split the workload done per shader - I guess the 8M loop iterations are more troublesome in this case.

Mar 24 '22 21:03 msiglreith

Run on the master branch of wgpu using M1 Mac，crashed on vk backend too. it works on metal backend, but the output are wrong:

... 59953, 59954, 59955, 59956, 59957, 59958, 59959, 59960, 59961, 59962, 59963, 59964, 59965, 59966, 59967, 59968, 59969, 59970, 59971, 59972, 59973, 59974, 59975, 59976, 59977, 59978, 59979, 59980, 59981, 59982, 59983, 59984, 59985, 59986, 59987, 59988, 59989, 59990, 59991, 59992, 59993, 59994, 59995, 59996, 59997, 59998, 59999]

If slightly change shader code from

result = max(result, a);
lots_of_data[i] = result;

to:

lots_of_data[i] = max(result, a);

both backends work fine and output the correct results:

... 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123]

Jun 10 '22 07:06 jinleili

wgpu wgpu copied to clipboard

Compute shader crash when doing sequential read and write from/to STORAGE buffer larger than 8MB

wgpu
wgpu copied to clipboard