bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Meshlet single pass depth downsampling (SPD)

Open JMS55 opened this issue 10 months ago • 3 comments

Objective

  • Using multiple raster passes to generate the depth pyramid is extremely slow
  • Pulling data from the source image is the largest bottleneck, it's important to sample in a cache-aware pattern
  • Barriers and pipeline drain between the raster passes is the second largest bottleneck
  • Each separate RenderPass on the CPU is really expensive

Solution

  • Port FidelityFX SPD to WGSL, replacing meshlet's existing multiple raster passes with a ~~single~~ two compute dispatches. Lack of coherent buffers means we have to do the the last 64x64 tile from mip 7+ in a separate dispatch to ensure the mip 6 writes were flushed :(
  • Workgroup shared memory version only at the moment, as the subgroup operation is blocked by our upgrade to wgpu 0.20 #13186
  • Don't enforce a power-of-2 depth pyramid texture size, simply scaling by 0.5 is fine

JMS55 avatar Apr 17 '24 01:04 JMS55

Which parts do you want me to review? Presumably downsample_depth.wgsl? Anything else?

pcwalton avatar Apr 17 '24 19:04 pcwalton

The downsample shader, yeah. I can't post a link ATM, but you can find the full PR diff by changing the GitHub diff to compare against my meshlet-previous-frame-depth-pyramid branch.

JMS55 avatar Apr 17 '24 20:04 JMS55

Downscale is broken, will need to debug. Might be the barrier issue.

JMS55 avatar Apr 24 '24 01:04 JMS55

I believe that there might be an improvement that can be made here:

fn reduce_load_mip_0(tex: vec2u) -> f32 {
    let uv = (vec2f(tex) + 0.5) / vec2f(textureDimensions(mip_0));
    return reduce_4(textureGather(mip_0, samplr, uv));
}

From what I understand from the spec, textureGather already computes the four component minimum and stores it in the w channel. So there is no need to reduce, just do the following:

fn reduce_load_mip_0(tex: vec2u) -> f32 {
    let uv = (vec2f(tex) + 0.5) / vec2f(textureDimensions(mip_0));
    return textureGather(mip_0, samplr, uv).w;
}

It is very possible that the compiler spots this optimization already.

Frankly I'm not sure how textureGather works exactly, in Vulkan you needed a VK_SAMPLER_REDUCTION_MODE_MIN sampler and corresponding extension to do this.

otoomey avatar Jun 26 '24 21:06 otoomey

textureGather does not return the minimum of the 4 values. The w component is the (u_min, v_min) value of the sample footprint. I.e. given 4 texels (the sample footprint) arranged in a 2x2 quad, the w component is the value at location (u_min, v_min).

Yes, ideally I would be able to use VK_SAMPLER_REDUCTION_MODE_MIN, but unfortunately wgpu does not support it.

JMS55 avatar Jun 27 '24 01:06 JMS55