TransformerEngine FP4 Training

Hi!

Wondering if there is any plan to support FP4 training in transformer engine. Would be great if so!

Thanks!

Apr 19 '25 00:04 cassanof

It depends what you mean by FP4 training. We do not yet plan to support full FP4 training (as in, both forward and backward) since there is no evidence yet that it would converge for large models/long token horizons. We do plan to support FP4 forward/mxFP8 backward though, with the main use case being fine-tuning for FP4 inference. This support is our next goal after finishing the DeepSeek fp8 recipe.

Apr 19 '25 15:04 ptrendx

It depends what you mean by FP4 training. We do not yet plan to support full FP4 training (as in, both forward and backward) since there is no evidence yet that it would converge for large models/long token horizons. We do plan to support FP4 forward/mxFP8 backward though, with the main use case being fine-tuning for FP4 inference. This support is our next goal after finishing the DeepSeek fp8 recipe.

That's a good idea! I have another question, how do you think there will be any advantage of FP6 training? Are there any plans about FP6 training?

Apr 21 '25 06:04 lixlbuaa

Are there any plans about FP6 training?

I am also curious about FP6. It seems like there is no computational benefit of FP6 over FP8 with blackwell; is it only for memory savings (e.g. caching KVs)? i guess loading the weights would be slightly faster.

Apr 21 '25 06:04 cassanof

You are correct that there is no computational benefit to FP6 compared to FP8 - for training the speedup would be close to 0. The main use case for FP6 is therefore inference, as it is memory bound and the decrease in size would directly translate to speedup. The "unfortunate" (for FP6) thing about inference though is that FP4 works great there and so it is a much preferred choice. That is why FP6 is much lower priority for us (across all libraries, not just TE) and there are currently no plans for it.

Apr 25 '25 00:04 ptrendx

https://arxiv.org/abs/2505.19115

Jun 22 '25 09:06 dxqb

https://arxiv.org/abs/2501.17116

Jun 22 '25 22:06 hg0428

@ptrendx FP4 is certainly feasible for the backward pass as well.

Jun 22 '25 22:06 hg0428

I've been following this one as well: https://arxiv.org/abs/2505.14669 It explicitly explored training in NVFP4 on Blackwell, with optimized kernels.

It also currently has a WIP implementation in the transformers library: https://github.com/huggingface/transformers/pull/38696

The kernels are unfortunately not public yet, however.

Jul 01 '25 21:07 kooshi

and another: https://arxiv.org/abs/2502.20586

TLDR of all four courtesy of Gemini:

Analysis of Competing Approaches to 4-bit LLM Training

Executive Summary

Four recent research papers independently demonstrate the viability of training Large Language Models (LLMs) using 4-bit floating-point (FP4) precision. While all achieve similar end goals, they represent four distinct philosophies and technical approaches to solving the challenges of ultra-low-precision training.

"FP4 All the Way" (Chmiel et al.) systematically tests various block sizes and scaling formats, concluding that NVFP4 is empirically superior. Its key contributions are a practical recipe (NVFP4 + split rounding) and a theoretical diagnostic (√3 threshold) to identify when FP4 training stagnates, justifying a final higher-precision fine-tuning phase.
"Quartet" (Castro et al.) provides a formal scaling-law framework for comparing methods and proposes a sophisticated technique (QuEST forward, SR backward). Its most actionable contribution is a set of optimized CUDA kernels for MXFP4 on Blackwell, making its performance claims concrete and immediately useful for practitioners.
"Optimizing LLM Training" (Wang et al.) tackles weight gradients and activation outliers with separate tools (DGE and OCC). Its unique hybrid computation—pairing a dense FP4 GEMM with a sparse high-precision GEMM—is a pragmatic trade-off, sacrificing computational purity to guarantee robustness against extreme outliers.
"Training LLMs with MXFP4" (Tseng et al.) focuses exclusively on accelerating the backward pass with a robust SR+RHT recipe. This approach captures significant speedups while sidestepping the forward-pass quantization problem, making it a stable and practical method for achieving near-lossless acceleration.

The research converges on the necessity of Stochastic Rounding (SR) for unbiased gradients and the Hadamard Transform for outlier management. However, the papers diverge on how and where to apply these techniques, and on the fundamental choice between the superior accuracy of the NVFP4 format and the practical, well-supported MXFP4 format.

Comprehensive Comparison of Methodologies

Feature	FP4 All the Way (Chmiel et al.)	Quartet (Castro et al.)	Optimizing LLM Training (Wang et al.)	Training LLMs with MXFP4 (Tseng et al.)
Primary Goal	Find a practical, working recipe for FQT through empirical analysis.	Define and implement a high-performance, "optimal" FQT method with hardware-specific kernels.	Propose a stable FP4 framework by tackling specific quantization challenges with targeted tools.	Achieve near-lossless training by accelerating the backward pass only.
Quantization Scope	Fully Quantized: All GEMM operands (weights, activations, gradients) are in FP4.	Fully Quantized: All three matrix multiplications in a linear layer are in FP4.	Hybrid Computation: Forward pass uses a dense FP4 GEMM plus a high-precision sparse GEMM.	Backward-Only: Forward pass remains in BF16. Backward pass GEMMs are in MXFP4.
Proposed Method	A specific recipe: NVFP4 format, split rounding (RtN forward, SR backward), and a final QAF phase.	A method named Quartet: MXFP4 format, QuEST (Hadamard + RMSE) forward pass, SR backward pass.	Framework combining DGE for weights and OCC for activations.	Recipe combining SR and RHT for the backward pass.
Chosen FP4 Format	NVFP4. Chosen based on superior empirical results from testing multiple formats, including its finer-grained 16-value blocks and E4M3 scaling.	MXFP4. Chosen for its hardware support, for which they developed optimized kernels. Uses 32-value blocks with E8M0 scaling.	E2M1 with simple `absmax` scaling. Does not leverage hardware-accelerated block formats.	MXFP4. Chosen for its clear specification and hardware relevance. Uses 32-value blocks with E8M0 scaling.
Outlier Handling	Implicitly managed by the format (finer-grained NVFP4) and rounding.	Hadamard Transform (in QuEST) in the forward pass to reduce quantization error by spreading outlier energy.	Outlier Clamping & Compensation (OCC): Explicitly clamps activation outliers and compensates for the error with a sparse matrix.	Random Hadamard Transform (RHT) in the backward pass to reduce the variance of stochastic rounding caused by outliers.
Actionable Contribution	A practical recipe and a theoretical diagnostic (`√3` threshold) for training stability.	Optimized Blackwell CUDA kernels for MXFP4, and a formal framework for comparing methods.	A set of specific, engineered solutions (DGE, OCC) for known failure modes.	A robust recipe (SR+RHT) for backward-pass acceleration, proven to be stable for long training runs.

Potential Synergies: Creating a Unified Best-of-Breed Approach

By combining the unique strengths of each paper, a more robust and performant FP4 training methodology can be constructed.

The Ultimate Fully Quantized Pipeline: A state-of-the-art FQT pipeline could be created by combining the best-in-class components for each pass.
- Format: Use NVFP4 as the data format. Its design with smaller 16-value blocks and more precise E4M3 scaling factors is architecturally and empirically suited for higher accuracy.
- Forward Pass (from "Quartet"): Use the QuEST method (Hadamard transform + RMSE-based clipping). This approach is explicitly designed to minimize the mean squared error of the forward pass, which is critical for preserving model accuracy.
- Backward Pass (from "Training LLMs with MXFP4"): Use the SR + RHT recipe. This combination is empirically and theoretically shown to produce unbiased gradient estimates with low variance, which is essential for stable convergence during long training runs.
- Implementation (inspired by "Quartet"): This entire pipeline would need to be implemented with highly optimized, fused CUDA kernels for NVFP4 on Blackwell to be practically effective.
A Universal Diagnostic and Tuning Framework: The analytical tools from two papers could be applied to all methods to create a comprehensive evaluation and tuning strategy.
- Diagnostic (from "FP4 All the Way"): The √3 signal-to-noise threshold can be used as a universal diagnostic to monitor any FP4 training run. When the gradient norm drops below this threshold, it signals that training is stagnating due to quantization noise.
- Fine-Tuning (from "FP4 All the Way"): Upon hitting the √3 threshold, a final Quantization-Aware Finetuning (QAF) phase can be initiated, where the backward pass is switched to a higher precision (e.g., BF16) to close the final performance gap with a full-precision baseline.
- Comparison (from "Quartet"): The scaling law framework (eff_N/eff_D) can be used to rigorously quantify the performance of any proposed method, providing a principled way to compare them beyond a simple visual inspection of loss curves.
Pragmatic Fallbacks for Robustness: Acknowledging that a pure FP4 pipeline may not be universally stable, a robust system could incorporate fallbacks.
- The OCC mechanism (Wang et al.) could serve as a dynamic safety net for models or data with pathological outliers that other methods cannot handle, trading off computational purity for guaranteed stability.
- The backward-only acceleration strategy (Tseng et al.) serves as a conservative, low-risk starting point for developers who want speedups without the stability risks of full quantization.

Synthesized Insights and Key Takeaways

Reading these four papers in concert reveals several crucial insights for the future of low-precision training:

The NVFP4 vs. MXFP4 Trade-off is Central. This choice represents a key hardware-software co-design decision. NVFP4 is designed for higher accuracy via finer-grained blocks and more precise fractional scaling (E4M3). MXFP4 is designed for simplicity and a wider dynamic range in its power-of-two scaling factor (E8M0). The optimal choice depends on the specific model's sensitivity and the available hardware support.
Unbiased Gradients are Non-Negotiable for Scale. The research provides strong evidence that for very long, large-scale pre-training, unbiased gradient estimators (like Stochastic Rounding) are essential to prevent a performance gap from emerging over time. Biased estimators may suffice for shorter runs but are not a viable path to true state-of-the-art model training.
The Duality of the Hadamard Transform. This mathematical tool emerges as a "Swiss Army knife" for quantization. It is used in two distinct ways across the papers: to reduce quantization error (MSE) in the forward pass ("Quartet") and to reduce the variance of stochastic rounding in the backward pass (Tseng et al.), demonstrating its versatility in managing different undesirable effects of quantization.
A Complete Solution Requires Managing the Full Error Profile. Successful FP4 training is not about solving a single problem. The papers collectively show the need to manage three distinct components of quantization error:
- Bias (solved by Stochastic Rounding).
- Variance (controlled by the Random Hadamard Transform).
- Deterministic Error / MSE (minimized by superior formats like NVFP4 and techniques like QuEST). A robust method must address all three.

Jul 02 '25 00:07 kooshi

It depends what you mean by FP4 training. We do not yet plan to support full FP4 training (as in, both forward and backward) since there is no evidence yet that it would converge for large models/long token horizons. We do plan to support FP4 forward/mxFP8 backward though, with the main use case being fine-tuning for FP4 inference. This support is our next goal after finishing the DeepSeek fp8 recipe.

I recommend checking out these methods. If they could be adapted to make FP4 training viable, I think that would be great.

Jul 10 '25 14:07 hg0428

according to this article it is stable https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/

it is worth it now?

Aug 30 '25 12:08 yash3056

It's already proven. GPT-OSS was trained in FP4. We don't know exactly what algorithm they used, but we have many great candidates in these papers.

Aug 30 '25 13:08 kooshi

@kooshi to be exact they use mxfp4, and according to nvidia nvfp4 is more stable than mxfp4, so I think it is time for it to be added

Aug 30 '25 16:08 yash3056

It depends what you mean by FP4 training. We do not yet plan to support full FP4 training (as in, both forward and backward) since there is no evidence yet that it would converge for large models/long token horizons. We do plan to support FP4 forward/mxFP8 backward though, with the main use case being fine-tuning for FP4 inference. This support is our next goal after finishing the DeepSeek fp8 recipe.

Hi, NVFP4 training recipe has been released. Will FP4 forward/mxFP8 backward continue to be developed? And why do we persist or give up? thx! :)

Oct 13 '25 09:10 lixlbuaa

Hi, NVFP4 training recipe has been released.

Finally! Looks like this issue is resolved by PR #2177

If anyone knows any details about the details of the implementation beyond the code itself, please do share.

Edit: here's the exact paper they implemented https://arxiv.org/abs/2509.25149v1 source: https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/fp8_primer.ipynb

Oct 13 '25 14:10 kooshi

If anyone knows any details about the details of the implementation beyond the code itself, please do share.

Thank you for your reply. The current release version uses FP4 for both forward and backward passes. However, in April, they plan to support FP4 forward/mxFP8 backward instead of full FP4 training. I'd like to know the reason for this change in plans.

Oct 15 '25 03:10 lixlbuaa

TransformerEngine TransformerEngine copied to clipboard

FP4 Training

Analysis of Competing Approaches to 4-bit LLM Training

Comprehensive Comparison of Methodologies

Potential Synergies: Creating a Unified Best-of-Breed Approach

Synthesized Insights and Key Takeaways

TransformerEngine
TransformerEngine copied to clipboard