web-llm Use subgroup operations when possible

Subgroups can substantially enhance performance and adaptability for machine learning tasks on GPUs. Since they're now available on origin trial, it means https://webllm.mlc.ai/ could take advantage of them.

I'm not sure what is needed yet to make it work... I assume some work in Apache TVM as well.

I highly recommend you check out the quick-start guide at https://developer.chrome.com/blog/new-in-webgpu-128#experimenting_with_subgroups. For info, only subgroupBallot and subgroupBroadcast are there for now but more built-in functions such as subgroupAdd, subgroupAll, subgroupElect, subgroupShuffle will be added in a near future.

Aug 20 '24 11:08 beaufortfrancois

@CharlieFRuan @tqchen What are your thoughts on this?

Sep 03 '24 09:09 beaufortfrancois

This is great, subgroup shuffle can be useful for reduction operations. We did have warp shuffle support for metal backend, so maybe we can try add codegen backend for webgpu

Sep 03 '24 14:09 tqchen

The following subgroup shuffle functions are actually in Chrome 129 (currently beta):

subgroupShuffle(value, id): Returns value from the active invocation whose subgroup_invocation_id matches id.
subgroupShuffleXor(value, mask): Returns value from the active invocation whose subgroup_invocation_id matches subgroup_invocation_id ^ mask. mask must be dynamically uniform.
subgroupShuffleUp(value, delta): Returns value from the active invocation whose subgroup_invocation_id matches subgroup_invocation_id - delta.
subgroupShuffleDown(value, delta): Returns value from the active invocation whose subgroup_invocation_id matches subgroup_invocation_id + delta.

Sep 03 '24 15:09 beaufortfrancois

@tqchen @CharlieFRuan Is this being implemented in Apache TVM?

Sep 09 '24 06:09 beaufortfrancois

Hi @beaufortfrancois Really appreciate the info and suggestions! We think it is a good idea to have it implemented in the TVM flow. Unfortunately, we are a bit out of bandwidth as of now. We'll revisit in the future!

Sep 10 '24 17:09 CharlieFRuan

According to https://groups.google.com/a/chromium.org/g/blink-dev/c/xteMk_tObgI/m/7wt8sloPDAAJ, Chrome is planning to ship subgroups in Chrome 134 (March 4th, 2025). This would be a great time to support them in WebLLM. What do you think folks?

Jan 13 '25 10:01 beaufortfrancois

(gentle ping) @tqchen @CharlieFRuan

Jan 17 '25 09:01 beaufortfrancois

Thanks for the info! Quick question: do all devices support subgroup ops? Or is it a device-dependent thing? Ideally, we only want to host a single set of WGSL kernels for each model, so each user downloads the same .wasm, since WebLLM does ahead-of-time compilation.

Jan 21 '25 21:01 CharlieFRuan

As you can see in https://source.chromium.org/search?q=%22EnableFeature(Feature::subgroups)%22%20f:%5Ethird_party%2Fdawn%2F&ss=chromium, not all devices support subgroups and you need to take this into account.

const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has("subgroups")) {
  throw new Error("Subgroups support is not available");
}
// Explicitly request subgroups support.
const device = await adapter.requestDevice({
  requiredFeatures: ["subgroups"],
});

Jan 22 '25 07:01 beaufortfrancois

@CharlieFRuan Does the answer above prevent you to use subgroups in WebLLM eventually?

Feb 03 '25 07:02 beaufortfrancois

I'll try to support shuffle for reduction operation in TVM's WebGPU this week and next week. One possibility is we compile two sets of kernels for each model, one for performant devices, and one for fallbacks (the current ones).

Feb 05 '25 16:02 CharlieFRuan

That's great to hear! Please share the Apache TVM issue or PR when they're available so that we can keep track of your work and help if needed.

Feb 06 '25 10:02 beaufortfrancois

@CharlieFRuan Did you have a chance to start working on it?

Feb 24 '25 10:02 beaufortfrancois

Yes, I hope to get a version by the end of this week if everything goes well.

Feb 25 '25 05:02 CharlieFRuan

Hi @beaufortfrancois! I was able to get an initial version done in TVM: https://github.com/apache/tvm/pull/17699

The PR description includes what is done and not done, and the dumped kernel compiled. The remaining part is mainly about UI, since not all WebGPU devices support subgroup. For WebLLM, I'd need to compile two sets of kernels, one for devices that support subgroup (and other future performant features, e.g. a high maxComputeInvocationsPerWorkgroup), and one set for fallbacks (current kernels).

One question I have is, typically what devices have a high maxComputeInvocationsPerWorkgroup? My M3 laptop has 1k, but IIRC it used to be 256 before. Any pointer would be helpful. I am considering what value to set for the more performant set of WebLLM kernels. Another note is that I always use 32 for the subgroup size, same as what TVM does to Metal and CUDA.

So the performant set of WebGPU kernels will require WebGPU devices (as of now):

maxComputeInvocationsPerWorkgroup = 1k is supported
Support subgroup
subgroup_size = 32 is supported

Edit: the benchmark seems to fluctuate quite a bit, I need more rigorous benchmarking to see the speedup from subgroup, another TODO. The E2E decode speedup (in tokens per second) is typically around 1.03x from my observation

Mar 03 '25 04:03 CharlieFRuan

This is great news @CharlieFRuan! Thanks for sharing!

FYI, I'm adding the subgroups feature in @webgpu/types at https://github.com/gpuweb/types/pull/167 which should help with https://github.com/apache/tvm/pull/17699/files#diff-cb3572240c47c4c62eaa4cc0e1e0cd15f88ae4c4222de860e9f63b01dc000090R113

One question I have is, typically what devices have a high maxComputeInvocationsPerWorkgroup? My M3 laptop has 1k, but IIRC it used to be 256 before. Any pointer would be helpful.

My understanding is that Chromium's maxComputeInvocationsPerWorkgroup limit varies across machines, with values of 128, 256, or 1024, based on the device's performance tier. (source)

// Tiers for limits related to workgroup size.
// TODO(crbug.com/dawn/685): Define these better. For now, use two tiers where one
// is available on nearly all desktop platforms.
//                                                             compat        tier0       tier1
#define LIMITS_WORKGROUP_SIZE(X)                                                                \
    X(Maximum,           maxComputeInvocationsPerWorkgroup,       128,         256,       1024) \

Another note is that I always use 32 for the subgroup size, same as what TVM does to Metal and CUDA.

I strongly suggest you always check subgroupMinSize and subgroupMaxSize GPU adapter info instead of 32 even though it may work for now. See this CL for more about their values in each backend.

I'm looking forward to more of your benchmarks!

Mar 03 '25 14:03 beaufortfrancois

The WebGPU maxComputeInvocationsPerWorkgroup limit corresponds to Vulkan's maxComputeWorkGroupInvocations limit.

See Sascha Willems' GPUinfo.or for this data on Vulkan: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupInvocations&platform=all That counts distinct GPUs reported, and is not weighted by number of users. But still it's pretty comprehensive.

1K is very common.

I would be more concerned about subgroup sizing tiers. My folk knowledge is 32 is ideal for NVIDIA; 64 better for AMD, 16 or 32 is good for Intel (Intel supports 8, 16, and 32, but the compiler tends to choose).

But picking 32 outright is a great start.

Mar 03 '25 14:03 dneto0

Thanks @beaufortfrancois @dneto0 for the insights and pointers, super helpful!

1K is very common.

I see, the link is quite insightful. I'll go with 1k for the performant set of WebLLM's WebGPU kernels.

I would be more concerned about subgroup sizing tiers. My folk knowledge is 32 is ideal for NVIDIA; 64 better for AMD, 16 or 32 is good for Intel (Intel supports 8, 16, and 32, but the compiler tends to choose).

CUDA 32, AMD 64, Metal 32 are aligned with the values in TVM. Since WebLLM hosts pre-compiled kernels, and I am not sure whether using a dynamic subgroup size value when compiling with TVM is a good idea, I think I'll go with 32 for now (seems to be the more widely accepted value).

The main thing is that it may create too much complication to host a plethora of pre-compiled WebGPU kernels for WebLLM for different {subgroup_size} x {maxInvocations} x ... . I think hosting two sets of kernels (performant and fallback) is a good starting point. If the device does not support 32 subgroup size (according to subgroupMinSize and subgroupMaxSize), it will use the fallback kernels.

Mar 03 '25 18:03 CharlieFRuan

@CharlieFRuan Did you have a chance to make some progress by any chance?

Mar 24 '25 11:03 beaufortfrancois

FYI According to https://github.com/microsoft/onnxruntime/commit/8eb5513be6dade1a91408313c5dd18d2dbeaef90, ONNX runtime see a 3x perf increase on Metal with subgroup matrices.

Mar 27 '25 09:03 beaufortfrancois

Out of curiosity, any news on that front @CharlieFRuan?

May 05 '25 07:05 beaufortfrancois

@beaufortfrancois Sorry for the delay... Not much update yet, but I do want to get this landed

May 05 '25 17:05 CharlieFRuan

@CharlieFRuan Any progress on this since it appeared in https://github.com/mlc-ai/web-llm/issues/707?

Oct 09 '25 08:10 beaufortfrancois