TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...

Results 414 TransformerEngine issues
Sort by recently updated
recently updated
newest added

# Description This PR mainly adds the partial cast feature for mxfp8 primary weights. In FSDP, since each forward and backward pass requires gathering params, it's better to only gather...

I have tried nvfp4 training which converges on sm120, but fp8blockscaled recipe won't converge for any of its available options. Is it because of power of 2 scale (cannot be...

I can implement NVFP4-supported linear layer calls with a simple script, but when I use Megatron-LM for NVFP4 training, I found that the TE lacks support for NVFP4Tensors in the...

# Description Fixes crashs for binary linked with both libtorch and libtransformer_engine running with `nsys profile` . It was caused by wrong libcudnn.so loaded when system package like `libcudnn9-cuda-12` is...

# Description FSDP2 Allgather Perf improvement and support for FusedAdam with FSDP2 Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the documentation, either...

# Description The fused cross entropy kernel in Transformer Engine uses 16-bit floating point (BF16) for the backward pass when the input is in BF16, whereas Megatron's VocabParallelCrossEntropy performs its...

1.Fused `moe_permute_with_probs` + `Fp8Padding` and fused `moe_unpermute` + `Fp8Unpadding`, which removes the explicit padding/unpadding in the MOE experts module, improved performance and reduced peak gpu memory usage. 2.Added tests of...

community-contribution

# Description Based on https://github.com/NVIDIA/TransformerEngine/pull/1948 Fixes the cuda graph order of backward_dw graphs when enabling `delay_wgrad_compute`, the user may delay the wgrad compute to the end of overlapped forward layers,...

# Description I want to be able to control num splits in FA3. This exposes this argument for non-context-parallel cases. ## Type of change - [ ] Documentation change (change...

# Description Please include a brief summary of the changes, relevant motivation and context. Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the...