TransformerEngine
TransformerEngine copied to clipboard
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...
# Description This PR creates the following folder ``` TransformerEngine/examples/pytorch/transformer: ├── context_parallel_runner_bshd.py ├── context_parallel_runner_thd.py ├── model.py ├── __pycache__ ├── README.md ├── run_context_parallel.sh ├── test_context_parallel_bshd.py ├── test_context_parallel_thd.py └── utils.py ``` That...
# Description This PR enables persistency of the MXFP8 cast kernel using WorkID Query feature on Blackwell (sm100a). Fixes # (issue) ## Type of change - [ ] Documentation change...
# Description This PR introduces support for CP + THD + chunked attention Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the documentation,...
# Description Support MLA CP exchanging latent KV (instead of the complete KV) for ring attention. Fixes # (issue) ## Type of change - [ ] Documentation change (change only...
# Description This is a small refactor of library loading logic during runtime to be more consistent and avoid duplication. The main point is to check python packages as a...
# Description This PR adds support for NVFP4 statistics: underflows and mse. I add them in seperate feature, because we may want to have a lot nvfp4-specific features added later....
# Description Motivation: https://github.com/NVIDIA/TransformerEngine/issues/2053 Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new content) - [...
I've found that the latest docker images (and presumably this repo broadly) do not support RTX Pro 6000 (SM120) for MXFP8 (see below error). I've been unable to find any...
# Description This PR adds short custom feature tutorial to precision debug tools docs. ## Type of change - [x] Documentation change (change only to the documentation, either a fix...
# Description Rework PDL for quantization in #2001 and #2066. Add two quantization configs - `pdl_sync`: Add `cudaGridDependencySynchronize` to the first quantization kernel, to make sure the previous unknown kernel...