TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...

Results 414 TransformerEngine issues
Sort by recently updated
recently updated
newest added

# Description Currently precision debug tools are not supported for FP8 model parameters. It is because all logic of debug tools is inside quantize() function in DebugQuantizers, which are not...

Many thanks for the great work! In the paper https://arxiv.org/pdf/2502.20853 they use 1D weight quantization with requantization with success, and also from their repo https://github.com/thu-ml/TetraJet-MXFP4Training/issues/2#issuecomment-3454394125 the author mentioned from their...

**Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when I am trying to compile...

I'm trying to use `softmax_type='learnable'` with the FusedAttention backend in Transformer Engine, but the system automatically falls back to UnfusedDotProductAttention even when FusedAttention is explicitly enabled. TE version 2.8.0, torch...

I've read the issue (https://github.com/NVIDIA/TransformerEngine/issues/1409) regarding the usage of cu_seqlens_q. It seems that I understand how cu_seqlens_q is used. However, I'm confused why cu_seqlens_q[-1] = cu_seqlens_q[-2] in the construction of...

# Description This PR fixes some hacky logic in the C++ `Tensor` class: - Construct uninitialized tensors with `shape=[0]`. Previously we constructed them as 0-D tensors, which should have one...

# Description This PR adds A2A CP support for JAX. ``` Before ================================================================================ TEST RUNTIME SUMMARY (grouped by function) ================================================================================ test | 12x | 1.97s | avg: 0.16s test_autocast_with_mesh_resource |...

2.10.0

**Is your feature request related to a problem? Please describe.** N/A **Describe the solution you'd like** Support DeepSeek FP8 recipe in JAX. Already supported in Pytorch. **Describe alternatives you've considered**...

FP8
Priority = P1

# Description This pull request involves efficient implementations for mxfp8 quantize on `casting only` cases. It can increase the casting performance from 5%~ 20%. It supports: + `BF16` or `FP16`...

community-contribution