TransformerEngine Plans for block-wise FP8 quantization during training?

Plans for block-wise FP8 quantization during training?

Open beccohov opened this issue 10 months ago • 5 comments

Hi TE team,

I'm interested in whether there are plans to implement block-wise quantization for FP8 training, similar to what's described in papers like "Deepseek V3".

Block quantization could potentially provide better numerical stability and accuracy compared to tensor-wide quantization, especially for outlier values. This could be particularly valuable for large language models where maintaining precision is crucial.

Some specific questions:

Is this feature currently on your roadmap?
If yes, what's the approximate timeline?
If no, are there technical challenges preventing this implementation?

Thank you for your time!

Jan 15 '25 13:01 beccohov

I have the same interest with block-wise FP8.

Jan 16 '25 10:01 zigzagcai

ME TOO

Jan 17 '25 06:01 liangzelang

In a addition, activation use tile-wise(1 x 128) quantization in DeepSeek-V3.

Jan 17 '25 08:01 Monekyzoon

In a addition, activation use tile-wise(1 x 128) quantization in DeepSeek-V3.

I am curious how they achieve efficient tile-wise(1 x 128) quantization. If simply use "for", the code would be very slow

Mar 05 '25 09:03 heguangxin

If you are watching the changes, you may have found that blockwise FP8 recipe will be added in TE soon (will be included in TE v2.3). If you are interested, you can try with PR#1559 in advance.

Apr 09 '25 05:04 BestJuly

TransformerEngine TransformerEngine copied to clipboard

Plans for block-wise FP8 quantization during training?

TransformerEngine
TransformerEngine copied to clipboard