Description

Two Goals:

Support Doing FP8 gather for the FP8 subchannel recipe, get rid of the force high precision gather branch, the benefit is that we don't have to save full precision tensor for backward, lower memory usage
Bookkeeping of whether a FP8 subtensor scaling has a compact or non-compact scaling factor tensor. CuBLAS requires certain swizzling of the scaling factors layouts (transpose, padding), in the case where no gather will be performed, quantization kernel should directly output the desired shape. Otherwise, it should output a compact format.

What needs to be done:

Need a 1x128 quantize kernel that supports three modes: quantize only, quantize transpose and quantize non-transpose, here the quantize non-transpose means that using one kernel to do 1x128 and 128x1 quantization, but for the 128x1 quantize, we skip the transpose for it, for the purpose of doing gather of it in backward
Enable coalesced gather of FP8 quantized tensor and its scaling factors
Need to correctly handle whether the scaling factor tensor has been transposed or padded, as opposed to being plain compact tensor.

Type of change

[ ] Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Infra/Build change
[ ] Code refactoring

Changes

Please list the changes introduced in this PR:

CUDA Kernel Ready
Scaling Factor Swizzling Control
FP8 Gather

Unit Tests

# quantizer test
pytest tests/pytorch/test_float8_blockwise_scaling_exact.py::test_quantization_1D_block_tiling_with_compact_data_and_scales -s -v

# layer test
pytest tests/pytorch/test_float8_blockwise_scaling_exact.py::TestFP8BlockScalingRecipeLayerNormLinear::test_fp8_current_scaling_with_layernorm_linear_module -s -v

# distributed test with fp8 gather 
pytest tests/pytorch/distributed/test_numerics.py -s -v

Checklist:

[ ] I have read and followed the contributing guidelines
[ ] The functionality is complete
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] My changes generate no new warnings
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] New and existing unit tests pass locally with my changes

Apr 21 '25 17:04 zhongbozhu

/te-ci pytorch L1

Apr 22 '25 03:04 zhongbozhu

The dequantize code for the quantized tensor assumes GEMM_READY format. It should at least assert that assumption.

Apr 22 '25 23:04 kwyss-nvidia

I like the new API with the GEMM_READY and COMPACT formats. Whether to override set_usage to also include the new settings I'm less sure about.

Apr 22 '25 23:04 kwyss-nvidia

There are fields on the Quantizer that capture the RowFmt and ColFmt. I think it will also be valuable to have fields on the QuantizedTensor that indicate how the data was quantized. I know the tensor has a handle to the quantizer, but the quantizer's usage can be modified, so having static descriptions of the data would be more reliable.

Apr 22 '25 23:04 kwyss-nvidia

/te-ci pytorch L1

Apr 30 '25 02:04 zhongbozhu

/te-ci pytorch L1

May 24 '25 07:05 zhongbozhu

@timmoon10 can you check this PR and check if it's good to go?

May 29 '25 23:05 zhongbozhu

Some issue occur with distributed test after rebased on top of https://github.com/NVIDIA/TransformerEngine/pull/1814

EDIT: bug fixed.

May 31 '25 00:05 zhongbozhu

/te-ci pytorch L1

Jun 03 '25 04:06 zhongbozhu

/te-ci L1

Jun 05 '25 04:06 timmoon10

/te-ci L1

Jun 05 '25 05:06 timmoon10

/te-ci L1 pytorch

Jun 05 '25 07:06 timmoon10

/te-ci pytorch L1

Jun 05 '25 18:06 timmoon10

/te-ci pytorch L1

Jun 05 '25 19:06 timmoon10

/te-ci pytorch L1

Jun 05 '25 22:06 zhongbozhu

TransformerEngine
TransformerEngine copied to clipboard

[PyTorch] FP8 Subchannel Recipe With FP8 Gather And Configurable Scaling Factor Tensor Swizzling

Description

Type of change

Changes

Unit Tests

Checklist:

TransformerEngine TransformerEngine copied to clipboard

[PyTorch] FP8 Subchannel Recipe With FP8 Gather And Configurable Scaling Factor Tensor Swizzling

Description

Type of change

Changes

Unit Tests

Checklist:

TransformerEngine
TransformerEngine copied to clipboard