TransformerEngine How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？

Hi, how can we set other Cute/Cutlass operators on TE? Like GEMM from DeepGEMM, which is a library designed for FP8 GEMMs with fine-grained scaling, as proposed in DeepSeek-V3.

Feb 26 '25 03:02 BolongLin

We will not add DeepGEMM into TE because it lacks the GEMM for wgrad (1x128 by 1x128) in back propagation. And its JIT mechanism brings non-negligible overheads in training.

We're landing DeepSeek-v3-like FP8 recipe (1x128 for activations and 128x128 for weights) in TE and we will use the block-wise GEMM from cuBLAS (to be released in CUDA 12.9), which has a comparable performance as DeepGEMM and both 1D2D (1x128 by 128x128) and 1D1D (1x128 by 1x128) support, and gets rid of the JIT overheads.

Feb 28 '25 01:02 yaox12

Hi Xin,

Could we expect an ETA on this?

Mar 19 '25 05:03 RuiWang1998

Hi Xin,

Could we expect an ETA on this?

See PR https://github.com/NVIDIA/TransformerEngine/pull/1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April).

Mar 19 '25 22:03 yaox12

Hi Xin, Could we expect an ETA on this?

See PR #1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April).

Could you share a preview release of CUDA 12.9? We just want to try block-wise FP8 GEMM functionality.

Mar 27 '25 07:03 zigzagcai

Hi Xin, Could we expect an ETA on this?

See PR #1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April).

Hi,

Checking at this PR[#1559 ], it seems that it only support linear module, but not grouped linear yet? Are there any plan to land the fine-grained scaling for grouped linear too?

Apr 03 '25 15:04 zobeideThePlayer

Hi Xin, Could we expect an ETA on this?

See PR #1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April).

Hi,

Checking at this PR[#1559 ], it seems that it only support linear module, but not grouped linear yet? Are there any plan to land the fine-grained scaling for grouped linear too?

Yes, it's included in https://github.com/NVIDIA/TransformerEngine/pull/1525.

Apr 07 '25 06:04 yaox12

(ETA early April).

Hi @yaox12, is there by any chance an updated ETA on cuda 12.9?

Apr 15 '25 06:04 thefacetakt

(ETA early April).

Hi @yaox12, is there by any chance an updated ETA on cuda 12.9?

The new ETA is some time next week. But you know, it's not under my control 😢

Apr 15 '25 07:04 yaox12

(ETA early April).

Hi @yaox12, is there by any chance an updated ETA on cuda 12.9?

The new ETA is some time next week. But you know, it's not under my control 😢

Thanks!

Apr 15 '25 07:04 thefacetakt

CUDA 12.9 has been released on May 1st. https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-9

We have added support for new scaling modes on Hopper (sm_90), including outer vector (per-channel/per-row), per-128-element, and per-128x128-block. Note that there is currently limited support for fused epilogues.

Now you can try FP8 training with the blockwise recipe by

with fp8_autocast(fp8_recipe=recipe.Float8BlockScaling()):
    model()

May 06 '25 02:05 yaox12

CUDA 12.9 has been released on May 1st. https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-9

We have added support for new scaling modes on Hopper (sm_90), including outer vector (per-channel/per-row), per-128-element, and per-128x128-block. Note that there is currently limited support for fused epilogues.

Now you can try FP8 training with the blockwise recipe by

with fp8_autocast(fp8_recipe=recipe.Float8BlockScaling()): model()

Wow~ So great!!!

May 06 '25 07:05 zigzagcai

CUDA 12.9 has been released on May 1st. https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-9

We have added support for new scaling modes on Hopper (sm_90), including outer vector (per-channel/per-row), per-128-element, and per-128x128-block. Note that there is currently limited support for fused epilogues.

Now you can try FP8 training with the blockwise recipe by

with fp8_autocast(fp8_recipe=recipe.Float8BlockScaling()): model()

@yaox12 so this feature still cannot be used on L40s machine, and only support on Hopper (sm_90) or later?

May 30 '25 02:05 yjzhong89