TransformerEngine
TransformerEngine copied to clipboard
Plans for block-wise FP8 quantization during training?
Hi TE team,
I'm interested in whether there are plans to implement block-wise quantization for FP8 training, similar to what's described in papers like "Deepseek V3".
Block quantization could potentially provide better numerical stability and accuracy compared to tensor-wide quantization, especially for outlier values. This could be particularly valuable for large language models where maintaining precision is crucial.
Some specific questions:
- Is this feature currently on your roadmap?
- If yes, what's the approximate timeline?
- If no, are there technical challenges preventing this implementation?
Thank you for your time!
I have the same interest with block-wise FP8.
ME TOO
In a addition, activation use tile-wise(1 x 128) quantization in DeepSeek-V3.
In a addition, activation use tile-wise(1 x 128) quantization in DeepSeek-V3.
I am curious how they achieve efficient tile-wise(1 x 128) quantization. If simply use "for", the code would be very slow
If you are watching the changes, you may have found that blockwise FP8 recipe will be added in TE soon (will be included in TE v2.3). If you are interested, you can try with PR#1559 in advance.