TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...

Results 414 TransformerEngine issues
Sort by recently updated
recently updated
newest added

# Description Reformatted FP8 meta to one set per tensor, removed `fp8_max` and `scale_inv` from the set of FP8 meta, and deleted unused functions and types. Fixes # (issue) To...

# Description Meta released LLama 3 model in April. We have tutorial for Llama 2. It turned out that it works with Llama 3. I changed comments within tutorial. They...

documentation
1.7.0

`theta` in `inv_freq` of `RotaryPositionEmbedding` is hard-coded to 10k https://github.com/NVIDIA/TransformerEngine/blob/50e7a3da8f3e04a054c9c7212bd80f71c6814a25/transformer_engine/pytorch/attention.py#L1371-L1377

# Description Remove `act_enum` from the del list in `ActLuPrimitive*.partition`. ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new...

# Description This PR helps resolve issues #614 and #629. Moving forward, we'd like to define attention mask consistently in PyTorch, Jax and Paddle as `True` being masking out the...

1.7.0

This PR moves all the userbuffers code in TE/pytorch to TE/common and refactors the interfaces to make TE/common/userbuffers accessible to all framework integrations. **To do:** - [x] Move userbuffers from...

# Description This PR adds THD support for fused attention (`F16_arbitrary_seqlen` backend). This feature allows users to run attention for two more cases: ``` case 1: no padding between sequences...

# Description I added the tutorials with finetuning and with generation for the Gemma model. Moreover I added few features that were neccessary to make my tutorials work. ## Type...

This PR refactors the logic for FP8 weight workspaces in `te.Linear`, `te.LayerNormLinear`, and `te.LayerNormMLP`. The existing logic is somewhat convoluted since it was designed to pass around raw UINT8 buffers...

bug
enhancement
1.7.0

Using FP8 to train a 1B model on H800 resulted in a significant decrease in throughput performance compared to FP16. However, upon examining the pytorch profiler, there is a significant...

performance