TransformerEngine
TransformerEngine copied to clipboard
[PyTorch] FP8 Subchannel Recipe With FP8 Gather And Configurable Scaling Factor Tensor Swizzling
Description
Two Goals:
- Support Doing FP8 gather for the FP8 subchannel recipe, get rid of the force high precision gather branch, the benefit is that we don't have to save full precision tensor for backward, lower memory usage
- Bookkeeping of whether a FP8 subtensor scaling has a compact or non-compact scaling factor tensor. CuBLAS requires certain swizzling of the scaling factors layouts (transpose, padding), in the case where no gather will be performed, quantization kernel should directly output the desired shape. Otherwise, it should output a compact format.
What needs to be done:
- Need a 1x128 quantize kernel that supports three modes: quantize only, quantize transpose and quantize non-transpose, here the quantize non-transpose means that using one kernel to do 1x128 and 128x1 quantization, but for the 128x1 quantize, we skip the transpose for it, for the purpose of doing gather of it in backward
- Enable coalesced gather of FP8 quantized tensor and its scaling factors
- Need to correctly handle whether the scaling factor tensor has been transposed or padded, as opposed to being plain compact tensor.
Type of change
- [ ] Documentation change (change only to the documentation, either a fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refactoring
Changes
Please list the changes introduced in this PR:
- CUDA Kernel Ready
- Scaling Factor Swizzling Control
- FP8 Gather
Unit Tests
# quantizer test
pytest tests/pytorch/test_float8_blockwise_scaling_exact.py::test_quantization_1D_block_tiling_with_compact_data_and_scales -s -v
# layer test
pytest tests/pytorch/test_float8_blockwise_scaling_exact.py::TestFP8BlockScalingRecipeLayerNormLinear::test_fp8_current_scaling_with_layernorm_linear_module -s -v
# distributed test with fp8 gather
pytest tests/pytorch/distributed/test_numerics.py -s -v
Checklist:
- [ ] I have read and followed the contributing guidelines
- [ ] The functionality is complete
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
/te-ci pytorch L1
The dequantize code for the quantized tensor assumes GEMM_READY format. It should at least assert that assumption.
I like the new API with the GEMM_READY and COMPACT formats. Whether to override set_usage to also include the new settings I'm less sure about.
There are fields on the Quantizer that capture the RowFmt and ColFmt. I think it will also be valuable to have fields on the QuantizedTensor that indicate how the data was quantized. I know the tensor has a handle to the quantizer, but the quantizer's usage can be modified, so having static descriptions of the data would be more reliable.
/te-ci pytorch L1
/te-ci pytorch L1
@timmoon10 can you check this PR and check if it's good to go?
Some issue occur with distributed test after rebased on top of https://github.com/NVIDIA/TransformerEngine/pull/1814
EDIT: bug fixed.
/te-ci pytorch L1
/te-ci L1
/te-ci L1
/te-ci L1 pytorch
/te-ci pytorch L1
/te-ci pytorch L1
/te-ci pytorch L1