TransformerEngine issues

attention_mask fill with -inf for UnfusedDotProductAttention

2

# Description UnfusedDotProductAttention in TE uses -10000 to fill in the attention mask, but the value is not small enough for some cases which leads to large diff between TE...

Agoniii

Draft: reduce cudagraph mem via preoallcations

# Description Please include a brief summary of the changes, relevant motivation and context. Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the...

JimmyZhang12

[JAX] cuBlasMp integration for CollectiveGemm custom op

2

# Description This PR integrates TE/common cuBlasMp bindings into the TE/JAX CollectiveGemm custom op. ## Type of change - [ ] Documentation change (change only to the documentation, either a...

denera

2.10.0

Add num_splits support for FA3 backend

2

# Description This is a continuation of the efforts in #2357. [FA3](https://github.com/Dao-AILab/flash-attention/blob/fbf24f67cf7f6442c5cfb2c1057f4bfc57e72d89/hopper/flash_attn_interface.py#L269) allows users to use the `num_splits` option to control the number of kernels launched for attention, which could...

cyanguwa

2.10.0

[PyTorch][NVFP4][MOE] NVFP4 Grouped Hadamard Amax Kernel

2

# Description This PR is one of the many on-going grouped kernels for NVFP4 to reduce CPU overhead and reduce quantization cost. **This PR is ready for code review** Action...

zhongbozhu

Enables specified cp rank slicing

1

# Description This MR enables one to specify the `cp_rank` to `get_batch_on_this_cp_rank` which lets one determine the batches for a specific rank without needing to provide full batches all ranks...

jomitchellnv

[Common] NVTEGroupedTensor class and helpers

# Description Please include a brief summary of the changes, relevant motivation and context. Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the...

phu0ngng

[PyTorch] Reduce CPU overheads

1

# Description Based on single GPU profiling of the `GroupedLinear` module, implement some optimizations in order to reduce CPU overhead due to PyTorch. ## Type of change - [ ]...

ksivaman

[JAX] Add support for sink attention in JAX

13

# Description PR #2148 added support for sink attention to common and PyTorch. This PR adds support for JAX. Fixes #2070 ``` BEFORE ================================================================================ TEST RUNTIME SUMMARY (grouped by function)...

pggPL

2.10.0

attention

[JAX] Re-use RHT matrix constant

1

# Description In certain cases the random sign mask and the normalization applied to the RHT matrix is not cached in TE/JAX, leading to a slight perf impact be these...

jberchtold-nvidia

TransformerEngine
TransformerEngine copied to clipboard

Metadata

attention_mask fill with -inf for UnfusedDotProductAttention

Draft: reduce cudagraph mem via preoallcations

[JAX] cuBlasMp integration for CollectiveGemm custom op

Add num_splits support for FA3 backend

[PyTorch][NVFP4][MOE] NVFP4 Grouped Hadamard Amax Kernel

Enables specified cp rank slicing

[Common] NVTEGroupedTensor class and helpers

[PyTorch] Reduce CPU overheads

[JAX] Add support for sink attention in JAX

[JAX] Re-use RHT matrix constant

← Metadata

Owner

Metadata

TransformerEngine TransformerEngine copied to clipboard

Metadata

← Metadata

Owner

Metadata

TransformerEngine
TransformerEngine copied to clipboard