FBGEMM Failure in import: undefined symbol error from Python 3.7 + CUDA113

I'm using Python 3.7 and CUDA113. I tried fbgemm-gpu and fbgemm-gpu-nightly from pip, both versions failed in import:

[root@/ml-code/data/michelangelo/examples/torchrec_example/test #]pip3 show fbgemm-gpu
Name: fbgemm-gpu
Version: 0.1.2
Summary: UNKNOWN
Home-page: https://github.com/pytorch/fbgemm
Author: FBGEMM Team
Author-email: [email protected]
License: BSD-3
Location: /usr/lib/python3.7/site-packages
Requires:
Required-by:

>>> import fbgemm_gpu
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/fbgemm_gpu/__init__.py", line 12, in <module>
    torch.ops.load_library(os.path.join(os.path.dirname(__file__), "fbgemm_gpu_py.so"))
  File "/usr/lib/python3.7/site-packages/torch/_ops.py", line 220, in load_library
    ctypes.CDLL(path)
  File "/usr/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/python3.7/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZN2at6detail20computeStorageNbytesEN3c108ArrayRefIlEES3_mm

Jun 07 '22 05:06 chongxiaoc

FBGEMM has a dependency on PyTorch installation. Have you installed PyTorch already?

import torch import fbgemm_gpu

Jun 07 '22 05:06 jianyuh

@jianyuh I installed torch 1.11 + cu113 already.

Jun 07 '22 05:06 chongxiaoc

Details:

>>> torch.__version__
'1.11.0+cu113'
>>> torch.cuda.is_available()
True
>>> import fbgemm_gpu
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/fbgemm_gpu/__init__.py", line 12, in <module>
    torch.ops.load_library(os.path.join(os.path.dirname(__file__), "fbgemm_gpu_py.so"))
  File "/usr/lib/python3.7/site-packages/torch/_ops.py", line 220, in load_library
    ctypes.CDLL(path)
  File "/usr/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/python3.7/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZN2at6detail20computeStorageNbytesEN3c108ArrayRefIlEES3_mm

Jun 07 '22 06:06 chongxiaoc

I tried with installing from src guide on homepage. The build can pass, for example:

[ 98%] Building CXX object bench/CMakeFiles/I8SpmdmBenchmark.dir/I8SpmdmBenchmark.cc.o
[ 99%] Building CXX object bench/CMakeFiles/I8SpmdmBenchmark.dir/BenchUtils.cc.o
[ 99%] Building CXX object bench/CMakeFiles/I8SpmdmBenchmark.dir/__/test/QuantizationHelpers.cc.o
[ 99%] Building CXX object bench/CMakeFiles/I8SpmdmBenchmark.dir/__/test/EmbeddingSpMDMTestUtils.cc.o
[ 99%] Linking CXX executable I8SpmdmBenchmark
make[2]: Leaving directory '/root/FBGEMM/build'
[ 99%] Built target I8SpmdmBenchmark
make[2]: Entering directory '/root/FBGEMM/build'
Scanning dependencies of target PackedFloatInOutBenchmark
make[2]: Leaving directory '/root/FBGEMM/build'
make[2]: Entering directory '/root/FBGEMM/build'
[ 99%] Building CXX object bench/CMakeFiles/PackedFloatInOutBenchmark.dir/PackedFloatInOutBenchmark.cc.o
[ 99%] Building CXX object bench/CMakeFiles/PackedFloatInOutBenchmark.dir/BenchUtils.cc.o
[100%] Building CXX object bench/CMakeFiles/PackedFloatInOutBenchmark.dir/__/test/QuantizationHelpers.cc.o
[100%] Building CXX object bench/CMakeFiles/PackedFloatInOutBenchmark.dir/__/test/EmbeddingSpMDMTestUtils.cc.o
[100%] Linking CXX executable PackedFloatInOutBenchmark
make[2]: Leaving directory '/root/FBGEMM/build'
[100%] Built target PackedFloatInOutBenchmark
make[1]: Leaving directory '/root/FBGEMM/build'
make: Leaving directory '/root/FBGEMM/build'

But still get same import errors. By checking ldd of fbgemm_gpu_py.so, some symbol links are missing:

[root@~/FBGEMM #]ldd /usr/lib/python3.7/site-packages/fbgemm_gpu/fbgemm_gpu_py.so
	linux-vdso.so.1 (0x00007ffdbaebb000)
	libtorch.so => not found
	libc10.so => not found
	libcuda.so.1 => /usr/local/nvidia/lib64/libcuda.so.1 (0x00007fce523ab000)
	libnvrtc.so.11.2 => /usr/local/cuda/lib64/libnvrtc.so.11.2 (0x00007fce4f676000)
	libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007fce4f46b000)
	libcudart.so.11.0 => /usr/local/cuda/lib64/libcudart.so.11.0 (0x00007fce4f1d2000)
	libc10_cuda.so => not found
	libnvidia-ml.so.1 => /usr/local/nvidia/lib64/libnvidia-ml.so.1 (0x00007fce4eb3a000)
	libtorch_cuda.so => not found
	libtorch_cuda_cpp.so => not found
	libtorch_cpu.so => not found
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fce4eb17000)
	libcublas.so.11 => /usr/local/cuda/lib64/libcublas.so.11 (0x00007fce47ad7000)
	libtorch_cuda_cu.so => not found
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fce47acd000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fce47949000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fce477c4000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fce477aa000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fce475ea000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fce6f32a000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fce475e5000)
	libcublasLt.so.11 => /usr/local/cuda/lib64/libcublasLt.so.11 (0x00007fce3ba07000)

GPU: RTX5000. Driver Version: 470.63.01

Jun 07 '22 18:06 chongxiaoc

Hi @chongxiaoc, thanks for sharing! It is a known issue we are working on! As a workaround please build fbgemm_gpu from source.

Jun 23 '22 15:06 geyyer

@chongxiaoc is this resolved for you?

Jul 25 '22 23:07 colin2328

@colin2328 I think we can close it for now. Obviously RTX5000 compute capability 7.5 is not supported from low level.

Jul 25 '22 23:07 chongxiaoc

Closing this issue, as FBGEMM_GPU builds have substantially changed over the last few months. Please feel free to file a new issue if you run into installation issues.

Apr 28 '23 18:04 q10

FBGEMM FBGEMM copied to clipboard

Failure in import: undefined symbol error from Python 3.7 + CUDA113

FBGEMM
FBGEMM copied to clipboard