How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling?
Hi, how can we set other Cute/Cutlass operators on TE? Like GEMM from DeepGEMM, which is a library designed for FP8 GEMMs with fine-grained scaling, as proposed in DeepSeek-V3.
We will not add DeepGEMM into TE because it lacks the GEMM for wgrad (1x128 by 1x128) in back propagation. And its JIT mechanism brings non-negligible overheads in training.
We're landing DeepSeek-v3-like FP8 recipe (1x128 for activations and 128x128 for weights) in TE and we will use the block-wise GEMM from cuBLAS (to be released in CUDA 12.9), which has a comparable performance as DeepGEMM and both 1D2D (1x128 by 128x128) and 1D1D (1x128 by 1x128) support, and gets rid of the JIT overheads.
Hi Xin,
Could we expect an ETA on this?
Hi Xin,
Could we expect an ETA on this?
See PR https://github.com/NVIDIA/TransformerEngine/pull/1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April).
Hi Xin, Could we expect an ETA on this?
See PR #1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April).
Could you share a preview release of CUDA 12.9? We just want to try block-wise FP8 GEMM functionality.
Hi Xin, Could we expect an ETA on this?
See PR #1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April).
Hi,
Checking at this PR[#1559 ], it seems that it only support linear module, but not grouped linear yet? Are there any plan to land the fine-grained scaling for grouped linear too?
Hi Xin, Could we expect an ETA on this?
See PR #1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April).
Hi,
Checking at this PR[#1559 ], it seems that it only support linear module, but not grouped linear yet? Are there any plan to land the fine-grained scaling for grouped linear too?
Yes, it's included in https://github.com/NVIDIA/TransformerEngine/pull/1525.
(ETA early April).
Hi @yaox12, is there by any chance an updated ETA on cuda 12.9?
(ETA early April).
Hi @yaox12, is there by any chance an updated ETA on cuda 12.9?
The new ETA is some time next week. But you know, it's not under my control 😢
(ETA early April).
Hi @yaox12, is there by any chance an updated ETA on cuda 12.9?
The new ETA is some time next week. But you know, it's not under my control 😢
Thanks!
CUDA 12.9 has been released on May 1st. https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-9
We have added support for new scaling modes on Hopper (sm_90), including outer vector (per-channel/per-row), per-128-element, and per-128x128-block. Note that there is currently limited support for fused epilogues.
Now you can try FP8 training with the blockwise recipe by
with fp8_autocast(fp8_recipe=recipe.Float8BlockScaling()):
model()
CUDA 12.9 has been released on May 1st. https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-9
We have added support for new scaling modes on Hopper (sm_90), including outer vector (per-channel/per-row), per-128-element, and per-128x128-block. Note that there is currently limited support for fused epilogues.
Now you can try FP8 training with the blockwise recipe by
with fp8_autocast(fp8_recipe=recipe.Float8BlockScaling()): model()
Wow~ So great!!!
CUDA 12.9 has been released on May 1st. https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-9
We have added support for new scaling modes on Hopper (sm_90), including outer vector (per-channel/per-row), per-128-element, and per-128x128-block. Note that there is currently limited support for fused epilogues.
Now you can try FP8 training with the blockwise recipe by
with fp8_autocast(fp8_recipe=recipe.Float8BlockScaling()): model()
@yaox12 so this feature still cannot be used on L40s machine, and only support on Hopper (sm_90) or later?