wgpu icon indicating copy to clipboard operation
wgpu copied to clipboard

Use @builtin(num_workgroups) or @builtin(local_invocation_index) is very slow on M1 MacBook

Open TYPEmber opened this issue 9 months ago • 2 comments

But it is fine on window 10 with Nvidia 2080ti.

When I dispatch a (4000, 1000, 1) workgroups, without use @builtin(num_workgroups) or @builtin(local_invocation_index) will take just 90ms, but if I import these two builtin values, it will take 1800ms on M1 MacBook.

wgpu version: 0.20.0 & latest https://github.com/gfx-rs/wgpu.git

@group(0)
@binding(0)
var<storage, read> data: array<i32>;
@group(0)
@binding(1)
var<storage, read_write> output: array<i32>;
@group(0)
@binding(2)
var<storage, read> kernal_offset: array<i32>;
@group(0)
@binding(3)
var<storage, read> kernel_value: array<i32>;

@compute
@workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>,
@builtin(num_workgroups) num_workgroups: vec3<u32>,
@builtin(local_invocation_index) local_invocation_index: u32
) {
    var sum = 0;

    let si = global_id.x + global_id.y * 4000;
    let index = si + (si / 4000) * (4040 - 4000);

    for (var i = 0; i < i32(arrayLength(&kernal_offset)); i ++) {
        sum += data[i32(index) + kernal_offset[i]] * kernel_value[i];
    }

    // When I add this line into the code
    // It becomes very slow.
    // sum += i32(num_workgroups.x);
    // sum += i32(local_invocation_index);
    
    output[si] = sum;
}

TYPEmber avatar May 08 '24 07:05 TYPEmber

This is expected behavior. You are using a workgroup of size (1, 1, 1) this is incredibly small and will destroy performance.

Instead you should use a larger workgroup size (32x32 tends to be good), dispatch it in larger chunks of size (125, 32) for your case. Finally, insert a branch to ensure you are within bounds on the array since this is actually acting over an "area" of 1024. If you're over 1000, you don't want to touch your array.

9291Sam avatar May 08 '24 14:05 9291Sam

Thanks for your reply. You are right, this workgroup_size is not for the best performance, it just an example to show the difference between M1 MacBook Air and Nvidia 2080ti Windows. Use @builtin(num_workgroups) or @builtin(local_invocation_index) won't destroy performance on Nvidia 2080ti Windows. And this is my code actually, but it still has the same difference between these two platforms.

@compute
@workgroup_size(workgroup_len)
fn main(@builtin(workgroup_id) workgroup_id: vec3<u32>,
@builtin(num_workgroups) num_workgroups: vec3<u32>,
@builtin(local_invocation_index) local_invocation_index: u32
) {
    var sum = 0;

    var si = 0u;
    var index = 0u;

    if (local_invocation_index == 0) {
        item_si = workgroup_id.x + workgroup_id.y * num_workgroups.x + workgroup_id.z * num_workgroups.x * num_workgroups.y;
        item_index = item_si + (item_si / 4000) * (4040 - 4000);
    }
    workgroupBarrier();

    index = item_index;
    si = item_si;

    for (var i = local_invocation_index; i < arrayLength(&kernal_offset); i += workgroup_len) {
        sum += data[i32(index) + kernal_offset[i]] * kernel_value[i];
    }

    if (local_invocation_index != workgroup_len - 1) {
        atomicAdd(&item_sum, sum);
    } 
    workgroupBarrier();

    if (local_invocation_index == workgroup_len - 1){
        output[si] = sum + item_sum;
    }
}

In a word, no matter how bad the code is, it should not take additional time by use @builtin(num_workgroups) instead of a constant number.

TYPEmber avatar May 08 '24 17:05 TYPEmber

In a word, no matter how bad the code is, it should not take additional time by use @builtin(num_workgroups) instead of a constant number.

I don't know what we can do about this though. We are doing an almost 1:1 translation from WGSL to MSL and the items in question are builtins. It's up to the driver to dispatch those workgroups and execute the shaders.

teoxoy avatar May 13 '24 15:05 teoxoy

Thank you for your contribution to this project! And I try to use naga to translate wgsl to msl, it does looks fine.

TYPEmber avatar May 17 '24 07:05 TYPEmber