onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

webgpu: optimize Gemm and MatMul using subgroup feature

Open xhcao opened this issue 2 months ago • 13 comments

Description

Motivation and Context

xhcao avatar Oct 29 '25 08:10 xhcao

@guschmue @fs-eire @qjia7 @jchen10 @Jiawei-Shao Hi, all, I want to discuss with you whether we could optimize Gemm, MatMul and Conv operators on some special vendor and special architectures as this PR. Reasons:

  1. It is difficult to design an algorithm that benifits all vendors and all architectures.
  2. Even for the same vendor, but different architectures, it is also difficult.
  3. Maintaining and reviewing the code is also difficult if adding some vendor and architecture information in common files.
  4. There are not enough devices to test the correctness and performance. ...

xhcao avatar Oct 29 '25 09:10 xhcao

@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!

jchen10 avatar Oct 29 '25 11:10 jchen10

@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!

Firstly, I want being consented to create a vendor directory and make some optimizations on some special platforms, especially from Microsoft reviewers' consent. If so, I will add the test cases in the other PR and apply performance data here.

xhcao avatar Oct 31 '25 02:10 xhcao

@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!

Firstly, I want being consented to create a vendor directory and make some optimizations on some special platforms, especially from Microsoft reviewers' consent. If so, I will add the test cases in the other PR and apply performance data here.

You don't need to depend on it. You can just best improve your PR with more test cases, better perf data, and easier to review.

jchen10 avatar Oct 31 '25 06:10 jchen10

could you please help to merge to latest main and push?

The pipeline failures should be unrelated, and may be fixed by rerun.

fs-eire avatar Nov 04 '25 01:11 fs-eire

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

guschmue avatar Nov 04 '25 19:11 guschmue

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines[bot] avatar Nov 04 '25 19:11 azure-pipelines[bot]

I understand the desire for a vendor directory. But I'm concerned that we'll have trouble with testing and CI if we have too many vendor specific paths - it will get fragmented and very hard to test all code path variations. We already head this problem with subgroup.matrixmultiply where I could not test a change for the intel specific code path so I had to let it fall back to different path.

But we can try and see how that works, just need to be careful with the testing

guschmue avatar Nov 10 '25 16:11 guschmue

@guschmue Thank you for your understanding and support. We will fully test our changes and closely monitor the status on LNL/BMG. Once mature enough we could roll out to more devices.

jchen10 avatar Nov 11 '25 01:11 jchen10

Haven't closely look at all code. But several high level comments.

  1. Please use workgroup_idx instead workgroup_id.x/y/z to recalculate the right offset.
  2. Please use the help function .setByOffset/getByOffset/setByIndices/getByindices to load/store data.
  3. Does your shader support any size of subgroup? I don't see sg_size related checking in your shader. I suppose intel's subgroup size range 8-32.
  4. Please use template to write the shader if possible.

qjia7 avatar Nov 18 '25 02:11 qjia7

@fs-eire Had addressed your comments. @qjia7 Except for the fourth comment (Because I will continue to optimize the PR, and address the comment in future), all the comments have been addressed. PTAL, thank you.

xhcao avatar Dec 01 '25 02:12 xhcao

@qjia7 @fs-eire PTAL, thanks.

xhcao avatar Dec 05 '25 05:12 xhcao

@qjia7 I had addressed your comments. @fs-eire @guschmue Please take a look, thanks.

xhcao avatar Dec 12 '25 05:12 xhcao

@qjia7 I tested the PR using ~20 models on Rocket-Lake, several models got performance improvement, but was less than Lunar-Lake. There was no regression on any models, so applied the PR to all intel devices support subgroup feature. Please take a look again.

xhcao avatar Dec 16 '25 09:12 xhcao

The shader changes look good to me. I’ll hand off the remaining structural review to @fs-eire and @guschmue. Thanks!

qjia7 avatar Dec 17 '25 06:12 qjia7

@fs-eire @guschmue PTAL

xhcao avatar Dec 23 '25 08:12 xhcao