TensorRT-LLM Feat: Support Linear block scale layout in FP4 quantization

Support Linear (row major) block scale factor layout in FP4 quantize kernel. This layout is used for trtllm-gen MOE FP4 kernel.
New Unit tests added to test the linear layout FP4 quantize kernel. Note that FP4 linear layout GEMM kernel is not supported yet. We should add FP4 GEMM when kernel is ready.

Mar 24 '25 22:03 yibinl-nvidia

Need to update internal_cutlass_kernel libs.

Mar 25 '25 04:03 yibinl-nvidia

Need to update internal_cutlass_kernel libs. @yibinl-nvidia is there mr for updating internal_cutlass_kernels?

Mar 25 '25 09:03 nv-guomingz

Need to update internal_cutlass_kernel libs. @yibinl-nvidia is there mr for updating internal_cutlass_kernels?

Yes, I will post a MR soon. I am still familiarizing myself with the internal kernel change workflow, and need to check trtllm test can pass with the updated lib files.

Mar 25 '25 16:03 yibinl-nvidia

/bot run

Mar 26 '25 21:03 yibinl-nvidia

@mikeiovine could you re-approve this PR? This is a mirror of the internal MR, with the minor changes on the internal_cutlass_kernel lib files. Thanks!

Mar 26 '25 21:03 yibinl-nvidia

/bot kill

Mar 26 '25 21:03 yibinl-nvidia

PR_Github #615 [ kill ] triggered by Bot

Mar 26 '25 21:03 tensorrt-cicd

PR_Github #615 [ kill ] completed with state SUCCESS Successfully killed previous jobs for commit aa306bf

Mar 26 '25 21:03 tensorrt-cicd

/bot run

Mar 26 '25 22:03 yibinl-nvidia

PR_Github #618 [ run ] triggered by Bot

Mar 26 '25 22:03 tensorrt-cicd

PR_Github #618 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #520 completed with status: 'FAILURE'

Mar 26 '25 23:03 tensorrt-cicd

/bot run

Mar 31 '25 06:03 yibinl-nvidia

Sorry for the delay! I've missed this in the move to Github. Looks good to me assuming there are only trivial changes compared to what I reviewed internally.

Mar 31 '25 13:03 mikeiovine

Sorry for the delay! I've missed this in the move to Github. Looks good to me assuming there are only trivial changes compared to what I reviewed internally.

Yes this a mirror of the change to the internal repo. The only difference is in the internal cutlass kernel directory, where the changes are bundled into lib files.

Mar 31 '25 17:03 yibinl-nvidia

/bot run

Mar 31 '25 23:03 yibinl-nvidia

PR_Github #806 [ run ] triggered by Bot

Mar 31 '25 23:03 tensorrt-cicd

PR_Github #806 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #652 completed with status: 'FAILURE'

Apr 01 '25 00:04 tensorrt-cicd

/bot run

Apr 01 '25 22:04 yibinl-nvidia

PR_Github #923 [ run ] triggered by Bot

Apr 01 '25 22:04 tensorrt-cicd

PR_Github #923 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #729 completed with status: 'FAILURE'

Apr 01 '25 22:04 tensorrt-cicd

/bot run

Apr 01 '25 23:04 yibinl-nvidia

PR_Github #935 [ run ] triggered by Bot

Apr 01 '25 23:04 tensorrt-cicd

PR_Github #935 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #737 completed with status: 'SUCCESS'

Apr 02 '25 02:04 tensorrt-cicd

Need to wait https://github.com/NVIDIA/TensorRT-LLM/pull/3071 to merge first

Apr 02 '25 02:04 yibinl-nvidia

@yibinl-nvidia #3071 had been merged and please revolve conflicts in this PR.

Apr 02 '25 04:04 nv-guomingz

/bot run

Apr 02 '25 05:04 yibinl-nvidia

PR_Github #971 [ run ] triggered by Bot

Apr 02 '25 05:04 tensorrt-cicd

PR_Github #971 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #756 completed with status: 'FAILURE'

Apr 02 '25 14:04 tensorrt-cicd

/bot run

Apr 02 '25 16:04 yibinl-nvidia

PR_Github #1034 [ run ] triggered by Bot

Apr 02 '25 16:04 tensorrt-cicd

TensorRT-LLM TensorRT-LLM copied to clipboard

Feat: Support Linear block scale layout in FP4 quantization

TensorRT-LLM
TensorRT-LLM copied to clipboard