TransformerEngine issues

[PyTorch Debug] Support precision debug tools for fp8 model parameters.

4

# Description Currently precision debug tools are not supported for FP8 model parameters. It is because all logic of debug tools is inside quantize() function in DebugQuantizers, which are not...

pggPL

Question regarding the 2D weight quantization and quality degradation in 1D case

2

Many thanks for the great work! In the paper https://arxiv.org/pdf/2502.20853 they use 1D weight quantization with requantization with success, and also from their repo https://github.com/thu-ml/TetraJet-MXFP4Training/issues/2#issuecomment-3454394125 the author mentioned from their...

TheTinyTeddy

Build Improvements - Find NCCL.h automatically from pypi nvidia-nccl-cu12/cu13

5

**Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when I am trying to compile...

Skylion007

Issue: Cannot use softmax_type='learnable' with FusedAttention backend

1

I'm trying to use `softmax_type='learnable'` with the FusedAttention backend in Transformer Engine, but the system automatically falls back to UnfusedDotProductAttention even when FusedAttention is explicitly enabled. TE version 2.8.0, torch...

jootanehorror

Question about the construction of cu_seqlens_q in ring attention unit test

1

I've read the issue (https://github.com/NVIDIA/TransformerEngine/issues/1409) regarding the usage of cu_seqlens_q. It seems that I understand how cu_seqlens_q is used. However, I'm confused why cu_seqlens_q[-1] = cu_seqlens_q[-2] in the construction of...

Kangkang-wky

[Core] Fix inconsistent logic in C++ tensor class

2

# Description This PR fixes some hacky logic in the C++ `Tensor` class: - Construct uninitialized tensors with `shape=[0]`. Previously we constructed them as 0-D tensors, which should have one...

timmoon10

[JAX] A2A CP support

2

# Description This PR adds A2A CP support for JAX. ``` Before ================================================================================ TEST RUNTIME SUMMARY (grouped by function) ================================================================================ test | 12x | 1.97s | avg: 0.16s test_autocast_with_mesh_resource |...

pggPL

2.10.0

Support DeepSeek FP8 recipe in JAX

**Is your feature request related to a problem? Please describe.** N/A **Describe the solution you'd like** Support DeepSeek FP8 recipe in JAX. Already supported in Pytorch. **Describe alternatives you've considered**...

nvMelissa

FP8

Priority = P1

Feature fast cast-only mxfp8

6

# Description This pull request involves efficient implementations for mxfp8 quantize on `casting only` cases. It can increase the casting performance from 5%~ 20%. It supports: + `BF16` or `FP16`...

Jianbing-D

community-contribution

Support multi-GPU DeepSeek recipe in TE/JAX

nvMelissa

TransformerEngine
TransformerEngine copied to clipboard

Metadata

[PyTorch Debug] Support precision debug tools for fp8 model parameters.

Question regarding the 2D weight quantization and quality degradation in 1D case

Build Improvements - Find NCCL.h automatically from pypi nvidia-nccl-cu12/cu13

Issue: Cannot use softmax_type='learnable' with FusedAttention backend

Question about the construction of cu_seqlens_q in ring attention unit test

[Core] Fix inconsistent logic in C++ tensor class

[JAX] A2A CP support

Support DeepSeek FP8 recipe in JAX

Feature fast cast-only mxfp8

Support multi-GPU DeepSeek recipe in TE/JAX

← Metadata

Owner

Metadata

TransformerEngine TransformerEngine copied to clipboard

Metadata

← Metadata

Owner

Metadata

TransformerEngine
TransformerEngine copied to clipboard