webgpu: optimize Gemm and MatMul using subgroup feature
Description
Motivation and Context
@guschmue @fs-eire @qjia7 @jchen10 @Jiawei-Shao Hi, all, I want to discuss with you whether we could optimize Gemm, MatMul and Conv operators on some special vendor and special architectures as this PR. Reasons:
- It is difficult to design an algorithm that benifits all vendors and all architectures.
- Even for the same vendor, but different architectures, it is also difficult.
- Maintaining and reviewing the code is also difficult if adding some vendor and architecture information in common files.
- There are not enough devices to test the correctness and performance. ...
@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!
@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!
Firstly, I want being consented to create a vendor directory and make some optimizations on some special platforms, especially from Microsoft reviewers' consent.
If so, I will add the test cases in the other PR and apply performance data here.
@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!
Firstly, I want being consented to create a
vendordirectory and make some optimizations on some special platforms, especially from Microsoft reviewers' consent. If so, I will add the test cases in the other PR and apply performance data here.
You don't need to depend on it. You can just best improve your PR with more test cases, better perf data, and easier to review.
could you please help to merge to latest main and push?
The pipeline failures should be unrelated, and may be fixed by rerun.
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline
Azure Pipelines successfully started running 4 pipeline(s).
I understand the desire for a vendor directory. But I'm concerned that we'll have trouble with testing and CI if we have too many vendor specific paths - it will get fragmented and very hard to test all code path variations. We already head this problem with subgroup.matrixmultiply where I could not test a change for the intel specific code path so I had to let it fall back to different path.
But we can try and see how that works, just need to be careful with the testing
@guschmue Thank you for your understanding and support. We will fully test our changes and closely monitor the status on LNL/BMG. Once mature enough we could roll out to more devices.
Haven't closely look at all code. But several high level comments.
- Please use workgroup_idx instead workgroup_id.x/y/z to recalculate the right offset.
- Please use the help function .setByOffset/getByOffset/setByIndices/getByindices to load/store data.
- Does your shader support any size of subgroup? I don't see sg_size related checking in your shader. I suppose intel's subgroup size range 8-32.
- Please use template to write the shader if possible.
@fs-eire Had addressed your comments. @qjia7 Except for the fourth comment (Because I will continue to optimize the PR, and address the comment in future), all the comments have been addressed. PTAL, thank you.
@qjia7 @fs-eire PTAL, thanks.
@qjia7 I had addressed your comments. @fs-eire @guschmue Please take a look, thanks.
@qjia7 I tested the PR using ~20 models on Rocket-Lake, several models got performance improvement, but was less than Lunar-Lake. There was no regression on any models, so applied the PR to all intel devices support subgroup feature. Please take a look again.
The shader changes look good to me. I’ll hand off the remaining structural review to @fs-eire and @guschmue. Thanks!
@fs-eire @guschmue PTAL