writing to storage buffers/textures in fragment shaders with multisample targets is not portable.
IIUC the spec says or implies if you're using a non-multisampled target you get 1 invocation of the fragment shader for each pixel being rendered
If rendering to a multisampled target then then you still only get 1 invocation per pixel, unless you're using sample_index or @interpolate(..., sample) in which case you'll get 1 invocation per sample
But, testing, that's not the case. We can test how many times a fragment shader is called with something like
@group(0) @binding(0) var<storage, read_write> count: atomic<u32>;
@fragment fn fs() -> @location(0) vec4f {
atomicAdd(&count, 1);
...
}
So, testing that
M1 Mac
PASS: sampleCount: 1, @interpolate(perspective) , , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective) , sample_index, count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample), , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index, count = 1, should be 1
FAIL: sampleCount: 4, @interpolate(perspective) , , count = 4, should be 1 <---
PASS: sampleCount: 4, @interpolate(perspective) , sample_index, count = 4, should be 4
PASS: sampleCount: 4, @interpolate(perspective, sample), , count = 4, should be 4
PASS: sampleCount: 4, @interpolate(perspective, sample), sample_index, count = 4, should be 4
AMD/Intel Mac
PASS: sampleCount: 1, @interpolate(perspective) , , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective) , sample_index, count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample), , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index, count = 1, should be 1
PASS: sampleCount: 4, @interpolate(perspective) , , count = 1, should be 1
FAIL: sampleCount: 4, @interpolate(perspective) , sample_index, count = 1, should be 4 <--
FAIL: sampleCount: 4, @interpolate(perspective, sample), , count = 1, should be 4 <--
FAIL: sampleCount: 4, @interpolate(perspective, sample), sample_index, count = 1, should be 4 <--
Windows NVidia 2070 Super
PASS: sampleCount: 1, @interpolate(perspective) , , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective) , sample_index, count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample), , count = 1, should be 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index, count = 1, should be 1
PASS: sampleCount: 4, @interpolate(perspective) , , count = 1, should be 1
FAIL: sampleCount: 4, @interpolate(perspective) , sample_index, count = 1, should be 4 <--
FAIL: sampleCount: 4, @interpolate(perspective, sample), , count = 1, should be 4 <--
FAIL: sampleCount: 4, @interpolate(perspective, sample), sample_index, count = 1, should be 4 <--
Does this need to be called out in the spec in some way? Like one idea is to mention it's not portable to use storage buffers/texture in a fragment shader because number of invocations in undefined. Another idea would be to generate a validation error if storage buffers/textures are written to a fragment shader with a mulitsample target.
It's also possible these are bugs in my code or in dawn or my understanding 😅
I am not surprised by this, but I am sad that it's not more consistent.
I'm curious how "count = 1, should be 4" ended up happening, did the compiler determine that sample interpolation and sample_index didn't affect the outcome of the shader? Or did it give wrong results for both?
Actually using sample_index cleaned up NVidia Windows
PASS: sampleCount: 1, @interpolate(perspective) , , count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective) , sample_index(not-used), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective) , sample_index( used ), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample), , count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index(not-used), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index( used ), count = 1, expected 1
PASS: sampleCount: 4, @interpolate(perspective) , , count = 1, expected 1
FAIL: sampleCount: 4, @interpolate(perspective) , sample_index(not-used), count = 1, expected 4 <---
PASS: sampleCount: 4, @interpolate(perspective) , sample_index( used ), count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample), , count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample), sample_index(not-used), count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample), sample_index( used ), count = 4, expected 4
M1 is still
PASS: sampleCount: 1, @interpolate(perspective) , , count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective) , sample_index(not-used), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective) , sample_index( used ), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample), , count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index(not-used), count = 1, expected 1
PASS: sampleCount: 1, @interpolate(perspective, sample), sample_index( used ), count = 1, expected 1
FAIL: sampleCount: 4, @interpolate(perspective) , , count = 4, expected 1 <---
PASS: sampleCount: 4, @interpolate(perspective) , sample_index(not-used), count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective) , sample_index( used ), count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample), , count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample), sample_index(not-used), count = 4, expected 4
PASS: sampleCount: 4, @interpolate(perspective, sample), sample_index( used ), count = 4, expected 4
which shows they're still different. Tried Intel/NVidia Mac. It's both are the same as NVidia Windows. I'm guessing AMD/Intel Mac will match so maybe this is just an M1 bug?
GPU Web WG 2024-06-26 Atlantic-time
- CW: The spec doesn’t define it, and implementations are not consistent with their sample rate. You might end up with 4 or 1 for a given texel/pixel.
- GT: I’m just pointing it out. I don’t know if we want to point it out in the spec, or somehow disallow storage textures when multisampling is on? Or just add a note saying “don’t use this”
- KN: If we were able to for the sample index to be used (compiler wouldn’t omit it) then we would get consistency for M1
- GW: Yes
- KN: I’m surprised that M1 … It seems like you’re writing to a storage buffer, so we better do it once-per-sample… I’m surprised they would choose to do 4x as much work as they need. It could be a bug?
- KG: In some ways, that’s what I expect. If you think about it the way it’s written, the as-if behavior is that it always runs 4 times. When it always run once, that’s an optimization “i can coalesce”. In this case, Apple is the only one doing it correctly and saying “i cannot coalesce these”
- KG: None of the other APIs attempt to hide that.
- GT: If you’re using a sample buffer or sample texture, then the backends can inject the use of sample index
- KG: as a work around
- CW: We could say “if you have side effects, you get sample-rate shading”
- KG: Well you’d say “you get sample rate shading anyway, and it’s not observable if you don’t get it”
- CW: Sample rate shading is not the behavior as-if, because a single computed sample
- MM: Sample rate shading spawns more threads. Not about writing to more spots in the resource
- KG: I’m worried we’re confusing a colloquial use of “sample rate shading” with the term-of-art
- CW: We could say in the spec “if you have side-effects, there is 1 invocation running per sample covered”, and it’s consistent across all backends and implementations.
- KG: Yes. That matches what I’m hoping for
- CW: We’ll ask MikeW too. How about you, GT?
- GT: Yep
- KN: How difficult is it to force a use of sampleindex? It often gets optimized out. How did you force it to be used, GT?
- GT: I assigned it to _. Also I assigned it to the output of the fragment shader
- KG: If we do it today and land the tests, and things continue to be like WebGL, and use our CTS as their criteria for release, then we’re fine. This is the class of workaround you usually want to have
- KN: We could go the other direction, and push implementations to do the other thing consistently. We wouldn’t force sampleIndex if you have side effects, and instead we would force only a single thread to be run if there are side-effects. However, if your shader uses sample index, we’ll still have to force usage of it, because some drivers will optimize it out and affect semantics, which is a no-go.
- CW: There always constants which are injected by the implementation which we know to be / or not be 0 but the compiler doesn’t
- KG: It sounds like there’s a strong desire to force multiple threads to run on all backends.
I am checking with Metal, I created a minimal repro for them based on the above website example.
For NVidia, We talked about having the backend use sample_index to try to force similar behavior. I wonder if it would be possible to do the opposite. If the WGSL compiler can see that sample_index is not actually used then remove it from the output shader
@greggman if we want this to be portable on macOS 15.0 and earlier then we would need to force sample-index for MSAA targets to get consistent behavior.
It does seem to be a driver issue on Apple devices and unexpected behavior. I am still discussing with Metal to see if we can workaround but any workaround is not obvious to me currently.
If it's a driver bug (that will get fixed eventually) then we don't need to bake the wart into WebGPU even if there's no workaround. Especially on Apple platforms we can expect buggy systems to age out over time.