TransformerEngine
TransformerEngine copied to clipboard
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...
# Description Reformatted FP8 meta to one set per tensor, removed `fp8_max` and `scale_inv` from the set of FP8 meta, and deleted unused functions and types. Fixes # (issue) To...
# Description Meta released LLama 3 model in April. We have tutorial for Llama 2. It turned out that it works with Llama 3. I changed comments within tutorial. They...
`theta` in `inv_freq` of `RotaryPositionEmbedding` is hard-coded to 10k https://github.com/NVIDIA/TransformerEngine/blob/50e7a3da8f3e04a054c9c7212bd80f71c6814a25/transformer_engine/pytorch/attention.py#L1371-L1377
# Description Remove `act_enum` from the del list in `ActLuPrimitive*.partition`. ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new...
# Description This PR helps resolve issues #614 and #629. Moving forward, we'd like to define attention mask consistently in PyTorch, Jax and Paddle as `True` being masking out the...
This PR moves all the userbuffers code in TE/pytorch to TE/common and refactors the interfaces to make TE/common/userbuffers accessible to all framework integrations. **To do:** - [x] Move userbuffers from...
# Description This PR adds THD support for fused attention (`F16_arbitrary_seqlen` backend). This feature allows users to run attention for two more cases: ``` case 1: no padding between sequences...
# Description I added the tutorials with finetuning and with generation for the Gemma model. Moreover I added few features that were neccessary to make my tutorials work. ## Type...
This PR refactors the logic for FP8 weight workspaces in `te.Linear`, `te.LayerNormLinear`, and `te.LayerNormMLP`. The existing logic is somewhat convoluted since it was designed to pass around raw UINT8 buffers...
Using FP8 to train a 1B model on H800 resulted in a significant decrease in throughput performance compared to FP16. However, upon examining the pytorch profiler, there is a significant...