TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...

Results 414 TransformerEngine issues
Sort by recently updated
recently updated
newest added

**Is your feature request related to a problem? Please describe.** At the moment, we enumerate the parameters in C APIs like this: https://github.com/NVIDIA/TransformerEngine/blob/5e4e0b2c378d2b1ec2ee65dfa85124e1dd805389/transformer_engine/common/fused_attn/fused_attn.cpp#L835 As we add more features to attention,...

refactor
attention

**Is your feature request related to a problem? Please describe.** This is not related to a problem, it is a feature request to expand model coverage **Describe the solution you'd...

attention

**Describe the bug** Hi, Are there any plans to publish prebuilt wheels? Right now during pip install, the pybind modules are being built via CMake in a brittle manner (accessing...

build

Is your feature request related to a problem? Please describe. To be added Describe the solution you'd like Work on improving performance for FP8 current scaling Describe alternatives you've considered...

performance
attention

**Is your feature request related to a problem? Please describe.** The logic around cuDNN's support matrix for SDPA is getting long and hard to maintain. **Describe the solution you'd like**...

refactor

**Describe the bug** A clear and concise description of what the bug is. **Steps/Code to reproduce bug** Please list *minimal* steps or code snippet for us to be able to...

bug
waiting-for-feedback

Hi I locally compiled branch release 2.8. when I tried to use nvfp4 on rtx 50 series it gave me error ``` /home/aza/workspace/projects/nvfp4/TransformerEngine/transformer_engine/common/util/nvfp4_transpose.cuh:234 in function mul_cvt_bf16_to_fp4_4x_with_rn (thread (95,0,0), block (2,2,0)):...

# Description This PR adds a persistent gated MXFP8 kernel optimized for rowwise scaling, SwiGLU activation (FWD and BWD) and BF16/FP16 input tensors. The kernel uses the "Cluster Launch Control"...

Does TransformerEngine have a distributed AdamW optimizer that we can use with DDP?