TransformerEngine issues

[JAX] Rewrite the Format of FP8 Meta and Remove unused ShardingTypes.

8

# Description Reformatted FP8 meta to one set per tensor, removed `fp8_max` and `scale_inv` from the set of FP8 meta, and deleted unused functions and types. Fixes # (issue) To...

mingxu1067

Added comments about Llama3 weights to Llama tutorial

# Description Meta released LLama 3 model in April. We have tutorial for Llama 2. It turned out that it works with Llama 3. I changed comments within tutorial. They...

pggPL

documentation

1.7.0

`inv_freq` of `RotaryPositionEmbedding` is hard-coded to 10k

1

`theta` in `inv_freq` of `RotaryPositionEmbedding` is hard-coded to 10k https://github.com/NVIDIA/TransformerEngine/blob/50e7a3da8f3e04a054c9c7212bd80f71c6814a25/transformer_engine/pytorch/attention.py#L1371-L1377

shijie-wu

[JAX] Fix the Failures on Partition of ActPrimitives

1

# Description Remove `act_enum` from the del list in `ActLuPrimitive*.partition`. ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new...

mingxu1067

[PyTorch/Jax] Fix attention mask definition, and sliding window for decoder

7

# Description This PR helps resolve issues #614 and #629. Moving forward, we'd like to define attention mask consistently in PyTorch, Jax and Paddle as `True` being masking out the...

cyanguwa

1.7.0

[C/PyTorch] Refactor and move userbuffers into TE/common

1

This PR moves all the userbuffers code in TE/pytorch to TE/common and refactors the interfaces to make TE/common/userbuffers accessible to all framework integrations. **To do:** - [x] Move userbuffers from...

denera

[C/PyTorch] Add THD support for cuDNN attention

7

# Description This PR adds THD support for fused attention (`F16_arbitrary_seqlen` backend). This feature allows users to run attention for two more cases: ``` case 1: no padding between sequences...

cyanguwa

Generation tutorial for Gemma model

8

# Description I added the tutorials with finetuning and with generation for the Gemma model. Moreover I added few features that were neccessary to make my tutorials work. ## Type...

pggPL

[PyTorch] Refactor FP8 workspaces in linear modules

4

This PR refactors the logic for FP8 weight workspaces in `te.Linear`, `te.LayerNormLinear`, and `te.LayerNormMLP`. The existing logic is somewhat convoluted since it was designed to pass around raw UINT8 buffers...

timmoon10

bug

enhancement

1.7.0

Training the 1B model on H800 resulted in a decrease in throughput

3

Using FP8 to train a 1B model on H800 resulted in a significant decrease in throughput performance compared to FP16. However, upon examining the pytorch profiler, there is a significant...

forevergj

performance

TransformerEngine
TransformerEngine copied to clipboard

Metadata

[JAX] Rewrite the Format of FP8 Meta and Remove unused ShardingTypes.

Added comments about Llama3 weights to Llama tutorial

`inv_freq` of `RotaryPositionEmbedding` is hard-coded to 10k

[JAX] Fix the Failures on Partition of ActPrimitives

[PyTorch/Jax] Fix attention mask definition, and sliding window for decoder

[C/PyTorch] Refactor and move userbuffers into TE/common

[C/PyTorch] Add THD support for cuDNN attention

Generation tutorial for Gemma model

[PyTorch] Refactor FP8 workspaces in linear modules

Training the 1B model on H800 resulted in a decrease in throughput

← Metadata

Owner

Metadata

TransformerEngine TransformerEngine copied to clipboard

Metadata

← Metadata

Owner

Metadata

TransformerEngine
TransformerEngine copied to clipboard