TransformerEngine
TransformerEngine copied to clipboard
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...
**Describe the bug** when running examples/llama/train_llama3_8b_fp8.sh and building transformer layer of GPTModel: RuntimeError: /TransformerEngine/transformer_engine/common/util/cuda_driver.cpp:42 in function get_symbol: Assertion failed: driver_result == cudaDriverEntryPointSuccess. Could not find CUDA driver entry point for...
# Description Currently we have the following: 1. We have some small e2e model training tests for a small model on a few epochs as a sanity integration test. 2....
# Description Added UBnext fast Allreduce kernels into linear layer. Falls under symmetric_ar_type with new type being 'ubnext' or 'ubnext_add_rms' #Details Added NVLS: simple and low latency (lamport) allreduce kernels...
# Description Follow-up of PR #2219 to apply warmup initialization to the rest of the TE/JAX primitives. Motivation, identical to previous PR: > If CUDA modules are loaded during any...
# Description pipeline aware cpu offload ## Type of change - [x] New feature (non-breaking change which adds functionality) ## Changes Please list the changes introduced in this PR: *...
# Description This is a more memory efficient version for using symmetric memory all reduces. We use a pool of symmetric memory that we grow if we need it to...
This adds a short delay kernel to the split_overlap_rs function, which ensures that the gemms are properly ordered when run with cuda graphs. # Description In some situations, such as...
# issue When testing the linear API provided by NVIDIA's transformer engine (with FP8 precision) on an L20 device, I found that its speed is significantly slower than PyTorch's built-in...
Hi all , If i wanted to do inference using E5M2 format instead of E4M3 what recipe do i have to use? or Can i use E5M2 for inference even,...
**Describe the bug** transformer-engine is currently searching for system CUDA binaries ( https://github.com/NVIDIA/TransformerEngine/blob/67fcc15255248a26be124de3854a47f84102f285/transformer_engine/common/__init__.py#L237). This is in conflict with Pytorch, which uses the CUDA Python packages (https://pypi.org/project/nvidia-cudnn-cu12/). **Steps/Code to reproduce bug**...