ImportError `undefined symbol` of `fused_layer_norm_cuda`
Describe the Bug
Minimal Steps/Code to Reproduce the Bug
I've followed the installation instruction in the README, but an ImportError occurs when I import fused_layer_norm_cuda. I think the problem may be caused by version conflicts between CUDA, torch, and GCC; however, I don't find any specific version dependency. 😵💫
gcc --version
python -c "import torch; print(torch.__version__); print(torch.version.cuda); import fused_layer_norm_cuda"
gcc (GCC) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
1.11.0+cu113
11.3
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: /mnt/lustre/sjtu/home/zcz72/anaconda3/envs/OFA3.9New/lib/python3.9/site-packages/fused_layer_norm_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZNSt18basic_stringstreamIcSt11char_traitsIcESaIcEEC1Ev
Expected Behavior
Environment
PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17
Python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB
Nvidia driver version: 460.73.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.23.1
[pip3] pytorch-lightning==1.0.8
[pip3] torch==1.11.0+cu113
[pip3] torchmetrics==0.9.3
[pip3] torchvision==0.12.0+cu113
[conda] numpy 1.23.1 py39hba7629e_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] pytorch-lightning 1.0.8 pypi_0 pypi
[conda] torch 1.11.0+cu113 pypi_0 pypi
[conda] torchmetrics 0.9.3 pypi_0 pypi
[conda] torchvision 0.12.0+cu113 pypi_0 pypi
I'm having this issue too.
Has anyone found a solution/workaround to this bug?
@XiaohanZhangCMU I got the bug when I was using a Singularity (now Apptainer) container with CUDA 11.7 and pytorch compiled for CUDA 11.7 while using my University's cluster that has CUDA 11.1. I was able to fix the issue by making a container that had CUDA 11.1 on it instead. That's obviously not a great solution.