TransformerEngine
TransformerEngine copied to clipboard
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...
# Description The functionality is ready but we're not seeing perf gain due to the performance regression of fused activation and quantization kernels, take the input in shape (8*4000, 4096)...
**Is your feature request related to a problem? Please describe.** `ulysess sp + ring attention` gives a good performance in SFT/RL training, which is called `hierarchical CP` here. But it...
Hi @taesiri 🤗 I'm Niels and work as part of the open-source team at Hugging Face. I discovered your work through Hugging Face's daily papers as yours got featured: https://huggingface.co/papers/2509.25149....
Refactors the test_checkpoint.py test suite to be a bit more pytest-native and removes the need to pre-generate checkpoint files. Also adds some (currently failing) torch.dcp and huggingface checkpoint tests.
Adds some currently failing huggingface tests around safetensors and quantized_model_init
# Description Fixes a bug that causes precision issues in mix-precision training. Current implementation of copy_ method in QuantizedTensor class does not properly pass the dst.dtype information when src is...
# Description In Megatron-Core + Transformer Engine (TE), we quantize activations to FP8 before the MoE up-projection and then run the dispatch. This is compatible with TE’s FP8 fprop for...
# Description This pr is used to adapt for offload activation (a new feature in Megatron-LM, https://github.com/NVIDIA/Megatron-LM/pull/1752). Offload activation select inputs of specific modules (such as `core_attn`, `qkv_linear`, `router_fc1`), offloading...
# Description Fix assertion error message formatting in DotProductAttention ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new content)...
Hello I am trying to run the latest Nvidia Cosmos model on a RTX 4090 and I get an error when fused attention is called : Line 1080 in fused_attn.py...