composable_kernel Bf16* fp4 gemm with bias and swiglu

Proposed changes

Added an example of bf16 * FP4 GEMM with bias and SwiGLU activation. In this implementation, both FP4 weights and FP4 scaling factors are stored in uint8 format. During computation, matrix B (FP4) is dequantized to bf16 before matrix multiplication. Additionally, bias addition and SwiGLU activation are fused into a single kernel to reduce memory bandwidth.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

[ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally
[ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
[ ] I have added inline documentation which enables the maintainers with understanding the motivation
[ ] I have removed the stale documentation which is no longer relevant after this pull request
[ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
[ ] I have run clang-format on all changed files
[ ] Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Sep 12 '25 10:09 eliotwang

Can you please resolve the merge conflict?

Oct 16 '25 18:10 illsilin

Also, sync to latest develop and make sure the CI tests are passing.

Oct 16 '25 18:10 illsilin

Can you please resolve the merge conflict? Sure, I will check.

Oct 17 '25 10:10 eliotwang

@eliotwang please resolve conflicts. Marking as stale as no progress since Oct30.

Dec 04 '25 18:12 cgmillette