TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...

Results 414 TransformerEngine issues
Sort by recently updated
recently updated
newest added

**Describe the bug** when running examples/llama/train_llama3_8b_fp8.sh and building transformer layer of GPTModel: RuntimeError: /TransformerEngine/transformer_engine/common/util/cuda_driver.cpp:42 in function get_symbol: Assertion failed: driver_result == cudaDriverEntryPointSuccess. Could not find CUDA driver entry point for...

bug

# Description Currently we have the following: 1. We have some small e2e model training tests for a small model on a few epochs as a sanity integration test. 2....

# Description Added UBnext fast Allreduce kernels into linear layer. Falls under symmetric_ar_type with new type being 'ubnext' or 'ubnext_add_rms' #Details Added NVLS: simple and low latency (lamport) allreduce kernels...

# Description Follow-up of PR #2219 to apply warmup initialization to the rest of the TE/JAX primitives. Motivation, identical to previous PR: > If CUDA modules are loaded during any...

# Description pipeline aware cpu offload ## Type of change - [x] New feature (non-breaking change which adds functionality) ## Changes Please list the changes introduced in this PR: *...

community-contribution

# Description This is a more memory efficient version for using symmetric memory all reduces. We use a pool of symmetric memory that we grow if we need it to...

community-contribution

This adds a short delay kernel to the split_overlap_rs function, which ensures that the gemms are properly ordered when run with cuda graphs. # Description In some situations, such as...

community-contribution

# issue When testing the linear API provided by NVIDIA's transformer engine (with FP8 precision) on an L20 device, I found that its speed is significantly slower than PyTorch's built-in...

Hi all , If i wanted to do inference using E5M2 format instead of E4M3 what recipe do i have to use? or Can i use E5M2 for inference even,...

question

**Describe the bug** transformer-engine is currently searching for system CUDA binaries ( https://github.com/NVIDIA/TransformerEngine/blob/67fcc15255248a26be124de3854a47f84102f285/transformer_engine/common/__init__.py#L237). This is in conflict with Pytorch, which uses the CUDA Python packages (https://pypi.org/project/nvidia-cudnn-cu12/). **Steps/Code to reproduce bug**...

bug