TransformerEngine issues

[PyTorch] Let `GroupedLinear` accept MXFP8 input and gradient

11

# Description The functionality is ready but we're not seeing perf gain due to the performance regression of fused activation and quantization kernels, take the input in shape (8*4000, 4096)...

yaox12

thd format is not supported with hierarchical CP implementation yet

6

**Is your feature request related to a problem? Please describe.** `ulysess sp + ring attention` gives a good performance in SFT/RL training, which is called `hierarchical CP` here. But it...

stormchasingg

Discuss release of NVFP4-trained LLM on Hugging Face

2

Hi @taesiri 🤗 I'm Niels and work as part of the open-source team at Hugging Face. I discovered your work through Hugging Face's daily papers as yours got featured: https://huggingface.co/papers/2509.25149....

NielsRogge

megatron

Refactor and checkpoint tests

1

Refactors the test_checkpoint.py test suite to be a bit more pytest-native and removes the need to pre-generate checkpoint files. Also adds some (currently failing) torch.dcp and huggingface checkpoint tests.

pstjohn

Add failing HF tests

Adds some currently failing huggingface tests around safetensors and quantized_model_init

pstjohn

Adds dst.dtype information in copy_ method of quantized tensors.

3

# Description Fixes a bug that causes precision issues in mix-precision training. Current implementation of copy_ method in QuantizedTensor class does not properly pass the dst.dtype information when src is...

zobeideThePlayer

community-contribution

Honor COMPACT data_format for FP8 blockwise scales in MoE up-projection path to remove 5× redundant rowwise_scale_inv.T.contiguous() passes

# Description In Megatron-Core + Transformer Engine (TE), we quantize activations to FP8 before the MoE up-projection and then run the dispatch. This is compatible with TE’s FP8 fprop for...

xiaoxi-wangfj

[main][feature][under updating]adapt for offload activation

1

# Description This pr is used to adapt for offload activation (a new feature in Megatron-LM, https://github.com/NVIDIA/Megatron-LM/pull/1752). Offload activation select inputs of specific modules (such as `core_attn`, `qkv_linear`, `router_fc1`), offloading...

GeYuhong

megatron

community-contribution

waiting-for-feedback

[PyTorch] Fix assertion error message formatting in DotProductAttention

3

# Description Fix assertion error message formatting in DotProductAttention ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new content)...

janbernloehr

community-contribution

Fused attention error while running Nvidia Cosmos

4

Hello I am trying to run the latest Nvidia Cosmos model on a RTX 4090 and I get an error when fused attention is called : Line 1080 in fused_attn.py...

deepbeepmeep

TransformerEngine
TransformerEngine copied to clipboard

Metadata

[PyTorch] Let `GroupedLinear` accept MXFP8 input and gradient

thd format is not supported with hierarchical CP implementation yet

Discuss release of NVFP4-trained LLM on Hugging Face

Refactor and checkpoint tests

Add failing HF tests

Adds dst.dtype information in copy_ method of quantized tensors.

Honor COMPACT data_format for FP8 blockwise scales in MoE up-projection path to remove 5× redundant rowwise_scale_inv.T.contiguous() passes

[main][feature][under updating]adapt for offload activation

[PyTorch] Fix assertion error message formatting in DotProductAttention

Fused attention error while running Nvidia Cosmos

← Metadata

Owner

Metadata

TransformerEngine TransformerEngine copied to clipboard

Metadata

← Metadata

Owner

Metadata

TransformerEngine
TransformerEngine copied to clipboard