gpuweb writing to storage buffers/textures in fragment shaders with multisample targets is not portable.

IIUC the spec says or implies if you're using a non-multisampled target you get 1 invocation of the fragment shader for each pixel being rendered

If rendering to a multisampled target then then you still only get 1 invocation per pixel, unless you're using sample_index or @interpolate(..., sample) in which case you'll get 1 invocation per sample

But, testing, that's not the case. We can test how many times a fragment shader is called with something like

     @group(0) @binding(0) var<storage, read_write> count: atomic<u32>; 

     @fragment fn fs() -> @location(0) vec4f {
       atomicAdd(&count, 1);
       ...
     }

So, testing that

M1 Mac

PASS: sampleCount: 1, @interpolate(perspective)        ,             , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective)        , sample_index, count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample),             , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index, count = 1, should be 1
FAIL: sampleCount: 4, @interpolate(perspective)        ,             , count = 4, should be 1  <---
PASS: sampleCount: 4, @interpolate(perspective)        , sample_index, count = 4, should be 4
PASS: sampleCount: 4, @interpolate(perspective, sample),             , count = 4, should be 4
PASS: sampleCount: 4, @interpolate(perspective, sample), sample_index, count = 4, should be 4

AMD/Intel Mac

PASS: sampleCount: 1, @interpolate(perspective)        ,             , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective)        , sample_index, count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample),             , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index, count = 1, should be 1
PASS: sampleCount: 4, @interpolate(perspective)        ,             , count = 1, should be 1
FAIL: sampleCount: 4, @interpolate(perspective)        , sample_index, count = 1, should be 4  <--
FAIL: sampleCount: 4, @interpolate(perspective, sample),             , count = 1, should be 4  <--
FAIL: sampleCount: 4, @interpolate(perspective, sample), sample_index, count = 1, should be 4  <--

Windows NVidia 2070 Super

PASS: sampleCount: 1, @interpolate(perspective)        ,             , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective)        , sample_index, count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample),             , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index, count = 1, should be 1
PASS: sampleCount: 4, @interpolate(perspective)        ,             , count = 1, should be 1
FAIL: sampleCount: 4, @interpolate(perspective)        , sample_index, count = 1, should be 4 <--
FAIL: sampleCount: 4, @interpolate(perspective, sample),             , count = 1, should be 4 <--
FAIL: sampleCount: 4, @interpolate(perspective, sample), sample_index, count = 1, should be 4 <--

Does this need to be called out in the spec in some way? Like one idea is to mention it's not portable to use storage buffers/texture in a fragment shader because number of invocations in undefined. Another idea would be to generate a validation error if storage buffers/textures are written to a fragment shader with a mulitsample target.

It's also possible these are bugs in my code or in dawn or my understanding 😅

Jun 07 '24 00:06 greggman

I am not surprised by this, but I am sad that it's not more consistent.

I'm curious how "count = 1, should be 4" ended up happening, did the compiler determine that sample interpolation and sample_index didn't affect the outcome of the shader? Or did it give wrong results for both?

Jun 07 '24 21:06 kainino0x

Actually using sample_index cleaned up NVidia Windows

PASS: sampleCount: 1, @interpolate(perspective)        ,                       , count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective)        , sample_index(not-used), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective)        , sample_index(  used  ), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample),                       , count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index(not-used), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index(  used  ), count = 1, expected 1
PASS: sampleCount: 4, @interpolate(perspective)        ,                       , count = 1, expected 1
FAIL: sampleCount: 4, @interpolate(perspective)        , sample_index(not-used), count = 1, expected 4 <---
PASS: sampleCount: 4, @interpolate(perspective)        , sample_index(  used  ), count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample),                       , count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample), sample_index(not-used), count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample), sample_index(  used  ), count = 4, expected 4

M1 is still

PASS: sampleCount: 1, @interpolate(perspective)        ,                       , count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective)        , sample_index(not-used), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective)        , sample_index(  used  ), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample),                       , count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index(not-used), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index(  used  ), count = 1, expected 1
FAIL: sampleCount: 4, @interpolate(perspective)        ,                       , count = 4, expected 1 <---
PASS: sampleCount: 4, @interpolate(perspective)        , sample_index(not-used), count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective)        , sample_index(  used  ), count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample),                       , count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample), sample_index(not-used), count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample), sample_index(  used  ), count = 4, expected 4

which shows they're still different. Tried Intel/NVidia Mac. It's both are the same as NVidia Windows. I'm guessing AMD/Intel Mac will match so maybe this is just an M1 bug?

Jun 07 '24 22:06 greggman

GPU Web WG 2024-06-26 Atlantic-time

CW: The spec doesn’t define it, and implementations are not consistent with their sample rate. You might end up with 4 or 1 for a given texel/pixel.
GT: I’m just pointing it out. I don’t know if we want to point it out in the spec, or somehow disallow storage textures when multisampling is on? Or just add a note saying “don’t use this”
KN: If we were able to for the sample index to be used (compiler wouldn’t omit it) then we would get consistency for M1
GW: Yes
KN: I’m surprised that M1 … It seems like you’re writing to a storage buffer, so we better do it once-per-sample… I’m surprised they would choose to do 4x as much work as they need. It could be a bug?
KG: In some ways, that’s what I expect. If you think about it the way it’s written, the as-if behavior is that it always runs 4 times. When it always run once, that’s an optimization “i can coalesce”. In this case, Apple is the only one doing it correctly and saying “i cannot coalesce these”
KG: None of the other APIs attempt to hide that.
GT: If you’re using a sample buffer or sample texture, then the backends can inject the use of sample index
KG: as a work around
CW: We could say “if you have side effects, you get sample-rate shading”
KG: Well you’d say “you get sample rate shading anyway, and it’s not observable if you don’t get it”
CW: Sample rate shading is not the behavior as-if, because a single computed sample
MM: Sample rate shading spawns more threads. Not about writing to more spots in the resource
KG: I’m worried we’re confusing a colloquial use of “sample rate shading” with the term-of-art
CW: We could say in the spec “if you have side-effects, there is 1 invocation running per sample covered”, and it’s consistent across all backends and implementations.
KG: Yes. That matches what I’m hoping for
CW: We’ll ask MikeW too. How about you, GT?
GT: Yep
KN: How difficult is it to force a use of sampleindex? It often gets optimized out. How did you force it to be used, GT?
GT: I assigned it to _. Also I assigned it to the output of the fragment shader
KG: If we do it today and land the tests, and things continue to be like WebGL, and use our CTS as their criteria for release, then we’re fine. This is the class of workaround you usually want to have
KN: We could go the other direction, and push implementations to do the other thing consistently. We wouldn’t force sampleIndex if you have side effects, and instead we would force only a single thread to be run if there are side-effects. However, if your shader uses sample index, we’ll still have to force usage of it, because some drivers will optimize it out and affect semantics, which is a no-go.
CW: There always constants which are injected by the implementation which we know to be / or not be 0 but the compiler doesn’t
KG: It sounds like there’s a strong desire to force multiple threads to run on all backends.

Jul 09 '24 18:07 Kangz

I am checking with Metal, I created a minimal repro for them based on the above website example.

Jul 10 '24 20:07 mwyrzykowski

For NVidia, We talked about having the backend use sample_index to try to force similar behavior. I wonder if it would be possible to do the opposite. If the WGSL compiler can see that sample_index is not actually used then remove it from the output shader

Jul 10 '24 21:07 greggman

@greggman if we want this to be portable on macOS 15.0 and earlier then we would need to force sample-index for MSAA targets to get consistent behavior.

It does seem to be a driver issue on Apple devices and unexpected behavior. I am still discussing with Metal to see if we can workaround but any workaround is not obvious to me currently.

Jul 27 '24 02:07 mwyrzykowski

If it's a driver bug (that will get fixed eventually) then we don't need to bake the wart into WebGPU even if there's no workaround. Especially on Apple platforms we can expect buggy systems to age out over time.

Aug 27 '24 04:08 kainino0x