Use subgroup operations when possible
Subgroups can substantially enhance performance and adaptability for machine learning tasks on GPUs. Since they're now available on origin trial, it means https://webllm.mlc.ai/ could take advantage of them.
I'm not sure what is needed yet to make it work... I assume some work in Apache TVM as well.
I highly recommend you check out the quick-start guide at https://developer.chrome.com/blog/new-in-webgpu-128#experimenting_with_subgroups. For info, only subgroupBallot and subgroupBroadcast are there for now but more built-in functions such as subgroupAdd, subgroupAll, subgroupElect, subgroupShuffle will be added in a near future.
@CharlieFRuan @tqchen What are your thoughts on this?
This is great, subgroup shuffle can be useful for reduction operations. We did have warp shuffle support for metal backend, so maybe we can try add codegen backend for webgpu
The following subgroup shuffle functions are actually in Chrome 129 (currently beta):
subgroupShuffle(value, id): Returnsvaluefrom the active invocation whosesubgroup_invocation_idmatchesid.subgroupShuffleXor(value, mask): Returnsvaluefrom the active invocation whosesubgroup_invocation_idmatchessubgroup_invocation_id ^ mask.maskmust be dynamically uniform.subgroupShuffleUp(value, delta): Returnsvaluefrom the active invocation whosesubgroup_invocation_idmatchessubgroup_invocation_id - delta.subgroupShuffleDown(value, delta): Returnsvaluefrom the active invocation whosesubgroup_invocation_idmatchessubgroup_invocation_id + delta.
@tqchen @CharlieFRuan Is this being implemented in Apache TVM?
Hi @beaufortfrancois Really appreciate the info and suggestions! We think it is a good idea to have it implemented in the TVM flow. Unfortunately, we are a bit out of bandwidth as of now. We'll revisit in the future!
According to https://groups.google.com/a/chromium.org/g/blink-dev/c/xteMk_tObgI/m/7wt8sloPDAAJ, Chrome is planning to ship subgroups in Chrome 134 (March 4th, 2025). This would be a great time to support them in WebLLM. What do you think folks?
(gentle ping) @tqchen @CharlieFRuan
Thanks for the info! Quick question: do all devices support subgroup ops? Or is it a device-dependent thing? Ideally, we only want to host a single set of WGSL kernels for each model, so each user downloads the same .wasm, since WebLLM does ahead-of-time compilation.
As you can see in https://source.chromium.org/search?q=%22EnableFeature(Feature::subgroups)%22%20f:%5Ethird_party%2Fdawn%2F&ss=chromium, not all devices support subgroups and you need to take this into account.
const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has("subgroups")) {
throw new Error("Subgroups support is not available");
}
// Explicitly request subgroups support.
const device = await adapter.requestDevice({
requiredFeatures: ["subgroups"],
});
@CharlieFRuan Does the answer above prevent you to use subgroups in WebLLM eventually?
I'll try to support shuffle for reduction operation in TVM's WebGPU this week and next week. One possibility is we compile two sets of kernels for each model, one for performant devices, and one for fallbacks (the current ones).
That's great to hear! Please share the Apache TVM issue or PR when they're available so that we can keep track of your work and help if needed.
@CharlieFRuan Did you have a chance to start working on it?
Yes, I hope to get a version by the end of this week if everything goes well.
Hi @beaufortfrancois! I was able to get an initial version done in TVM: https://github.com/apache/tvm/pull/17699
The PR description includes what is done and not done, and the dumped kernel compiled. The remaining part is mainly about UI, since not all WebGPU devices support subgroup. For WebLLM, I'd need to compile two sets of kernels, one for devices that support subgroup (and other future performant features, e.g. a high maxComputeInvocationsPerWorkgroup), and one set for fallbacks (current kernels).
One question I have is, typically what devices have a high maxComputeInvocationsPerWorkgroup? My M3 laptop has 1k, but IIRC it used to be 256 before. Any pointer would be helpful. I am considering what value to set for the more performant set of WebLLM kernels. Another note is that I always use 32 for the subgroup size, same as what TVM does to Metal and CUDA.
So the performant set of WebGPU kernels will require WebGPU devices (as of now):
- maxComputeInvocationsPerWorkgroup = 1k is supported
- Support
subgroup - subgroup_size = 32 is supported
Edit: the benchmark seems to fluctuate quite a bit, I need more rigorous benchmarking to see the speedup from subgroup, another TODO. The E2E decode speedup (in tokens per second) is typically around 1.03x from my observation
This is great news @CharlieFRuan! Thanks for sharing!
FYI, I'm adding the subgroups feature in @webgpu/types at https://github.com/gpuweb/types/pull/167 which should help with https://github.com/apache/tvm/pull/17699/files#diff-cb3572240c47c4c62eaa4cc0e1e0cd15f88ae4c4222de860e9f63b01dc000090R113
One question I have is, typically what devices have a high maxComputeInvocationsPerWorkgroup? My M3 laptop has 1k, but IIRC it used to be 256 before. Any pointer would be helpful.
My understanding is that Chromium's maxComputeInvocationsPerWorkgroup limit varies across machines, with values of 128, 256, or 1024, based on the device's performance tier. (source)
// Tiers for limits related to workgroup size.
// TODO(crbug.com/dawn/685): Define these better. For now, use two tiers where one
// is available on nearly all desktop platforms.
// compat tier0 tier1
#define LIMITS_WORKGROUP_SIZE(X) \
X(Maximum, maxComputeInvocationsPerWorkgroup, 128, 256, 1024) \
Another note is that I always use 32 for the subgroup size, same as what TVM does to Metal and CUDA.
I strongly suggest you always check subgroupMinSize and subgroupMaxSize GPU adapter info instead of 32 even though it may work for now. See this CL for more about their values in each backend.
I'm looking forward to more of your benchmarks!
The WebGPU maxComputeInvocationsPerWorkgroup limit corresponds to Vulkan's maxComputeWorkGroupInvocations limit.
See Sascha Willems' GPUinfo.or for this data on Vulkan: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupInvocations&platform=all That counts distinct GPUs reported, and is not weighted by number of users. But still it's pretty comprehensive.
1K is very common.
I would be more concerned about subgroup sizing tiers. My folk knowledge is 32 is ideal for NVIDIA; 64 better for AMD, 16 or 32 is good for Intel (Intel supports 8, 16, and 32, but the compiler tends to choose).
But picking 32 outright is a great start.
Thanks @beaufortfrancois @dneto0 for the insights and pointers, super helpful!
1K is very common.
I see, the link is quite insightful. I'll go with 1k for the performant set of WebLLM's WebGPU kernels.
I would be more concerned about subgroup sizing tiers. My folk knowledge is 32 is ideal for NVIDIA; 64 better for AMD, 16 or 32 is good for Intel (Intel supports 8, 16, and 32, but the compiler tends to choose).
CUDA 32, AMD 64, Metal 32 are aligned with the values in TVM. Since WebLLM hosts pre-compiled kernels, and I am not sure whether using a dynamic subgroup size value when compiling with TVM is a good idea, I think I'll go with 32 for now (seems to be the more widely accepted value).
The main thing is that it may create too much complication to host a plethora of pre-compiled WebGPU kernels for WebLLM for different {subgroup_size} x {maxInvocations} x ... . I think hosting two sets of kernels (performant and fallback) is a good starting point. If the device does not support 32 subgroup size (according to subgroupMinSize and subgroupMaxSize), it will use the fallback kernels.
@CharlieFRuan Did you have a chance to make some progress by any chance?
FYI According to https://github.com/microsoft/onnxruntime/commit/8eb5513be6dade1a91408313c5dd18d2dbeaef90, ONNX runtime see a 3x perf increase on Metal with subgroup matrices.
Out of curiosity, any news on that front @CharlieFRuan?
@beaufortfrancois Sorry for the delay... Not much update yet, but I do want to get this landed
@CharlieFRuan Any progress on this since it appeared in https://github.com/mlc-ai/web-llm/issues/707?