onnxruntime webgpu: optimize Gemm and MatMul using subgroup feature

Description

Motivation and Context

Oct 29 '25 08:10 xhcao

@guschmue @fs-eire @qjia7 @jchen10 @Jiawei-Shao Hi, all, I want to discuss with you whether we could optimize Gemm, MatMul and Conv operators on some special vendor and special architectures as this PR. Reasons:

It is difficult to design an algorithm that benifits all vendors and all architectures.
Even for the same vendor, but different architectures, it is also difficult.
Maintaining and reviewing the code is also difficult if adding some vendor and architecture information in common files.
There are not enough devices to test the correctness and performance. ...

Oct 29 '25 09:10 xhcao

@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!

Oct 29 '25 11:10 jchen10

@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!

Firstly, I want being consented to create a vendor directory and make some optimizations on some special platforms, especially from Microsoft reviewers' consent. If so, I will add the test cases in the other PR and apply performance data here.

Oct 31 '25 02:10 xhcao

@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!

Firstly, I want being consented to create a vendor directory and make some optimizations on some special platforms, especially from Microsoft reviewers' consent. If so, I will add the test cases in the other PR and apply performance data here.

You don't need to depend on it. You can just best improve your PR with more test cases, better perf data, and easier to review.

Oct 31 '25 06:10 jchen10

could you please help to merge to latest main and push?

The pipeline failures should be unrelated, and may be fixed by rerun.

Nov 04 '25 01:11 fs-eire

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

Nov 04 '25 19:11 guschmue

Azure Pipelines successfully started running 4 pipeline(s).

Nov 04 '25 19:11 azure-pipelines[bot]

I understand the desire for a vendor directory. But I'm concerned that we'll have trouble with testing and CI if we have too many vendor specific paths - it will get fragmented and very hard to test all code path variations. We already head this problem with subgroup.matrixmultiply where I could not test a change for the intel specific code path so I had to let it fall back to different path.

But we can try and see how that works, just need to be careful with the testing

Nov 10 '25 16:11 guschmue

@guschmue Thank you for your understanding and support. We will fully test our changes and closely monitor the status on LNL/BMG. Once mature enough we could roll out to more devices.

Nov 11 '25 01:11 jchen10

Haven't closely look at all code. But several high level comments.

Please use workgroup_idx instead workgroup_id.x/y/z to recalculate the right offset.
Please use the help function .setByOffset/getByOffset/setByIndices/getByindices to load/store data.
Does your shader support any size of subgroup? I don't see sg_size related checking in your shader. I suppose intel's subgroup size range 8-32.
Please use template to write the shader if possible.

Nov 18 '25 02:11 qjia7

@fs-eire Had addressed your comments. @qjia7 Except for the fourth comment (Because I will continue to optimize the PR, and address the comment in future), all the comments have been addressed. PTAL, thank you.

Dec 01 '25 02:12 xhcao

Attach shder files

gemm_subgroup_shader.txt gemm_subgroup_vec4_shader.txt

Dec 01 '25 07:12 xhcao

@qjia7 @fs-eire PTAL, thanks.

Dec 05 '25 05:12 xhcao

@qjia7 I had addressed your comments. @fs-eire @guschmue Please take a look, thanks.

Dec 12 '25 05:12 xhcao

@qjia7 I tested the PR using ~20 models on Rocket-Lake, several models got performance improvement, but was less than Lunar-Lake. There was no regression on any models, so applied the PR to all intel devices support subgroup feature. Please take a look again.

Dec 16 '25 09:12 xhcao

The shader changes look good to me. I’ll hand off the remaining structural review to @fs-eire and @guschmue. Thanks!

Dec 17 '25 06:12 qjia7

@fs-eire @guschmue PTAL

Dec 23 '25 08:12 xhcao