TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

Using system CUDA libraries

Open spectralflight opened this issue 3 months ago • 1 comments

Describe the bug

transformer-engine is currently searching for system CUDA binaries ( https://github.com/NVIDIA/TransformerEngine/blob/67fcc15255248a26be124de3854a47f84102f285/transformer_engine/common/init.py#L237). This is in conflict with Pytorch, which uses the CUDA Python packages (https://pypi.org/project/nvidia-cudnn-cu12/).

Steps/Code to reproduce bug

Tried using transformer-engine in a Docker container that did not have a system CUDA installed.

Expected behavior

transformer-engine should find the CUDA libraries inside the CUDA Python packages. Example:

import nvidia.cudnn
nvidia.cudnn.__file__
lib_path = os.path.join(nvidia.cudnn.__file__, "lib")

spectralflight avatar Sep 02 '25 20:09 spectralflight

  1. In most cases, the import infrastructure searches for system installs first: https://github.com/NVIDIA/TransformerEngine/blob/af2a0c16ec11363c0af84690cd877a59f898820e/transformer_engine/common/init.py#L234-L247

  2. If that fails, it searches for a Python package: https://github.com/NVIDIA/TransformerEngine/blob/af2a0c16ec11363c0af84690cd877a59f898820e/transformer_engine/common/init.py#L249-L252

  3. As a last resort, it does nothing and hopes that the linker can find the shared lib.

We prefer prioritizing the system install over the Python package because it is more configurable. If you already have an install of CUDA/cuDNN/etc, or perhaps multiple installs, then you can specify the desired library by setting environment variables.

timmoon10 avatar Oct 08 '25 23:10 timmoon10

I think it makes more sense to search the other way round as you load a conda environment with the expectation that it will supersede the system level installs.

hscarter avatar Dec 08 '25 16:12 hscarter