compute-shader-101 Atomic coherency test

This overwrites the compute-shader-hello example to be a test of atomic coherency. My understanding is that with the barriers it should run with 0 failures, even in strategy 0. With strategy 1 (atomicOr) as a workaround, it seems to be working.

Oct 29 '21 00:10 raphlinus

A bit of explanation: this is an attempt to make a clean, isolated test for some atomic coherence failures I've been observing. It's written in WGSL and uses WebGPU, but on the system I've been exploring (AMD 5700XT), I am seeing similar failures in a Vulkan-only context.

One advantage to having the test case be in WGSL is that there is one concepts of what the WGSL memory model should be, even if it isn't written down yet. By contrast, in Vulkan-land, the ideas are spread across Vulkan, SPIR-V, and (to a lesser extent) GLSL. Further, there is considerable diversity in how much the memory model is "turned on," in particular whether SPV 1.5 turns on the VulkanMemoryModel capability and whether the corresponding feature is enabled in device creation; the sense I get is that this is fairly rare, and it's far more common to be running in some kind of hybrid compatibility mode.

In any case, it's my feeling that the test in this PR should always pass. It fails on 5700XT. I find that there are two workarounds, which I think sheds some light on the issue.

The first is to use ordinary loads and stores rather than atomics, and to enable the coherent qualifier on the buffer. This is basically what the current piet-gpu code does, as those shaders are written in GLSL with an eye towards compatibility, ie not requiring an explicit Vulkan memory model feature. However, this is not consistent with the current approach for translating WGSL. It's possible that the recommended WGSL translation should change to be more like this, however, for better compatibility. One way of thinking about this is that there's no way (I know of) to get current WGSL translation output from a GLSL, so making it resemble existing GLSL translation output might reduce the number of exciting and fun bugs (such as this one).

The second is to use atomicOr(, 0) instead of atomicLoad for the load of the data word. By my reading, the semantics of that should be identical (wrt the yet-to-be-written WGSL and the actual Vulkan memory models), but the former conveys more of an intent of "actually go to memory rather than cache". I can see doing that to unblock my work in getting a webgpu prefix sum implementation out there.

I've done a number of experiments to validate that this is downstream of wgpu's shader translation (naga). One is to use tint to generate spv. Another is to manually add OpMemberDecorate Coherent to the data buffer (which I now understand is not expected to make any difference for purely atomic operations; it affects the handling of ordinary loads and stores). Finally, I've done an experiment of fully enabling the vulkan memory model (on Vk 1.2 device creation), and adding MakeAvailable+Release / MakeVisible+Acquire memory semantics flags to the relevant atomics (this would not be viable from wgsl sources, as only relaxed semantics are currently spec'ed). None of these make a difference.

I'd like to get some form of this into the webgpu test suite. That should hopefully have the effect of shaking out bugs in implementations.

Oct 29 '21 04:10 raphlinus

Another fascinating datapoint: if I fully enable the Vulkan memory model and change the load of the data word to gl_SemanticsAcquire | gl_SemanticsMakeVisible, then it passes. I believe that for correctness acquire/release is only needed on the flag load/store, and the data can be relaxed, but intuitively I can see how these would signal "actually go to memory instead of cache".

Also note, I have to hand-edit spv for this to pass validation. If I simply add these flags and use glslangValidator to compile, it produces output that's missing OpCapability VulkanMemoryModelDeviceScopeKHR. If I add that by hand, it passes validation.

Oct 29 '21 04:10 raphlinus

I just pushed a new version of the test which uses atomics for both data and flag. The previous version was broken, as there was a race between nonatomic load and store of the data field.

I believe the new version unambiguously demonstrates failures with respect to the Vulkan memory model. Since all memory access is atomic, it doesn't implicate the availability/visibility issues as discussed in https://github.com/gpuweb/gpuweb/issues/2228, and also I believe both the tint and naga translations of this test into SPIR-V are correct.

Nov 03 '21 22:11 raphlinus