TransformerEngine Using system CUDA libraries

Describe the bug

transformer-engine is currently searching for system CUDA binaries ( https://github.com/NVIDIA/TransformerEngine/blob/67fcc15255248a26be124de3854a47f84102f285/transformer_engine/common/init.py#L237). This is in conflict with Pytorch, which uses the CUDA Python packages (https://pypi.org/project/nvidia-cudnn-cu12/).

Steps/Code to reproduce bug

Tried using transformer-engine in a Docker container that did not have a system CUDA installed.

Expected behavior

transformer-engine should find the CUDA libraries inside the CUDA Python packages. Example:

import nvidia.cudnn
nvidia.cudnn.__file__
lib_path = os.path.join(nvidia.cudnn.__file__, "lib")

Sep 02 '25 20:09 spectralflight

In most cases, the import infrastructure searches for system installs first: https://github.com/NVIDIA/TransformerEngine/blob/af2a0c16ec11363c0af84690cd877a59f898820e/transformer_engine/common/init.py#L234-L247
If that fails, it searches for a Python package: https://github.com/NVIDIA/TransformerEngine/blob/af2a0c16ec11363c0af84690cd877a59f898820e/transformer_engine/common/init.py#L249-L252
As a last resort, it does nothing and hopes that the linker can find the shared lib.

We prefer prioritizing the system install over the Python package because it is more configurable. If you already have an install of CUDA/cuDNN/etc, or perhaps multiple installs, then you can specify the desired library by setting environment variables.

Oct 08 '25 23:10 timmoon10

I think it makes more sense to search the other way round as you load a conda environment with the expectation that it will supersede the system level installs.

Dec 08 '25 16:12 hscarter