Using system CUDA libraries
Describe the bug
transformer-engine is currently searching for system CUDA binaries ( https://github.com/NVIDIA/TransformerEngine/blob/67fcc15255248a26be124de3854a47f84102f285/transformer_engine/common/init.py#L237). This is in conflict with Pytorch, which uses the CUDA Python packages (https://pypi.org/project/nvidia-cudnn-cu12/).
Steps/Code to reproduce bug
Tried using transformer-engine in a Docker container that did not have a system CUDA installed.
Expected behavior
transformer-engine should find the CUDA libraries inside the CUDA Python packages. Example:
import nvidia.cudnn
nvidia.cudnn.__file__
lib_path = os.path.join(nvidia.cudnn.__file__, "lib")
-
In most cases, the import infrastructure searches for system installs first: https://github.com/NVIDIA/TransformerEngine/blob/af2a0c16ec11363c0af84690cd877a59f898820e/transformer_engine/common/init.py#L234-L247
-
If that fails, it searches for a Python package: https://github.com/NVIDIA/TransformerEngine/blob/af2a0c16ec11363c0af84690cd877a59f898820e/transformer_engine/common/init.py#L249-L252
-
As a last resort, it does nothing and hopes that the linker can find the shared lib.
We prefer prioritizing the system install over the Python package because it is more configurable. If you already have an install of CUDA/cuDNN/etc, or perhaps multiple installs, then you can specify the desired library by setting environment variables.
I think it makes more sense to search the other way round as you load a conda environment with the expectation that it will supersede the system level installs.