FBGEMM icon indicating copy to clipboard operation
FBGEMM copied to clipboard

unable to import fbgemm_gpu

Open vkuzo opened this issue 4 months ago • 10 comments

Hi folks! When I install fbgemm-gpu-genai from pip, I am unable to import the library. Neither stable or nightly versions work for me. Repro:

(pytorch) [[email protected] ~/local/ao (20250821_float8_tensor_fix)]$ with-proxy pip install --pre fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/nightly/cu128/
Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/
Collecting fbgemm-gpu-genai
  Using cached https://download.pytorch.org/whl/nightly/cu128/fbgemm_gpu_genai-2025.8.20%2Bcu128-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (2.7 kB)
Requirement already satisfied: numpy in /home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages (from fbgemm-gpu-genai) (2.2.3)
Using cached https://download.pytorch.org/whl/nightly/cu128/fbgemm_gpu_genai-2025.8.20%2Bcu128-cp311-cp311-manylinux_2_28_x86_64.whl (32.2 MB)
Installing collected packages: fbgemm-gpu-genai
Successfully installed fbgemm-gpu-genai-2025.8.20+cu128
(pytorch) [[email protected] ~/local/ao (20250821_float8_tensor_fix)]$ python
Python 3.11.0 (main, Mar  1 2023, 18:26:19) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fbgemm_gpu
ERROR:root:Could not load the library 'experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so'!


Could not load this library: /home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so



Traceback (most recent call last):
  File "/data/users/vasiliy/pytorch/torch/_ops.py", line 1487, in load_library
    ctypes.CDLL(path)
  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so: undefined symbol: _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE15_M_replace_coldEPcmPKcmm

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages/fbgemm_gpu/__init__.py", line 90, in <module>
    _load_library(f"{library}.so", __variant__ == "docs")
  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages/fbgemm_gpu/__init__.py", line 22, in _load_library
    raise error
  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages/fbgemm_gpu/__init__.py", line 17, in _load_library
    torch.ops.load_library(os.path.join(os.path.dirname(__file__), filename))
  File "/data/users/vasiliy/pytorch/torch/_ops.py", line 1489, in load_library
    raise OSError(f"Could not load this library: {path}") from e
OSError: Could not load this library: /home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so
>>> 

I am on an H100 with PyTorch built from source.

vkuzo avatar Aug 21 '25 11:08 vkuzo

I also repro on the same machine with PyTorch version '2.7.1+cu128'.

vkuzo avatar Aug 21 '25 11:08 vkuzo

The missing symbol _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE15_M_replace_coldEPcmPKcmm appears to indicate that the glibcxx that is installed on the system is old. Could you try this on a more recent OS version, i.e. one with glibc >=2.28 installed?

q10 avatar Aug 21 '25 21:08 q10

I have glibc 2.34:

(pytorch) [[email protected] ~/local/ao (20250821_float8_tensor_fix)]$ getconf GNU_LIBC_VERSION
glibc 2.34

vkuzo avatar Aug 22 '25 11:08 vkuzo

here is some more info about how I compiled PyTorch:

(pytorch) [[email protected] ~/local/pytorch (main)]$ gcc --version
gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-9)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(pytorch) [[email protected] ~/local/pytorch (main)]$ python -c "import torch; print(torch.__config__.show())" | grep -i cxx
  - Build settings: BUILD_TYPE=Release, COMMIT_SHA=49ff884b1edc3b872eeb2387ec60ef230cae7f24, CUDA_VERSION=12.6, CUDNN_VERSION=9.8.0, CXX_COMPILER=/usr/lib64/ccache/c++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.9.0, USE_CUDA=1, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EIGEN_FOR_BLAS=ON, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=OFF, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=OFF, USE_XPU=OFF, 

vkuzo avatar Aug 22 '25 12:08 vkuzo

we investigated this a bit more in preparation for the torchao v0.13.0 release, and here is what we observe:

  • PyTorch 2.7.1 and 2.8 + fbgemm_gpu stable -> import fbgemm_gpu works
  • recent PyTorch nightly + fbgemm_gpu nightly -> import fbgemm_gpu works
  • PyTorch 2.7.1 and 2.8 + fbgemm_gpu nightly -> import fbgemm_gpu leads to Aborted (core dumped)
  • recent PyTorch nightly + fbgemm_gpu stable -> import fbgemm_gpu leads to segmentation fault

is this a KP? Is there any issue we can follow to understand more?

vkuzo avatar Aug 27 '25 14:08 vkuzo

we investigated this a bit more in preparation for the torchao v0.13.0 release, and here is what we observe:

  • PyTorch 2.7.1 and 2.8 + fbgemm_gpu stable -> import fbgemm_gpu works
  • recent PyTorch nightly + fbgemm_gpu nightly -> import fbgemm_gpu works
  • PyTorch 2.7.1 and 2.8 + fbgemm_gpu nightly -> import fbgemm_gpu leads to Aborted (core dumped)
  • recent PyTorch nightly + fbgemm_gpu stable -> import fbgemm_gpu leads to segmentation fault

is this a KP? Is there any issue we can follow to understand more?

Yes, this observation is expected and in line with how the nightlies and stable releases are intended to work

  • PyTorch 2.7.1 and 2.8 + fbgemm_gpu stable -> This works bc fbgemm_gpu stable is built on torch stable, i.e. fbgemm 1.3.0 is built on torch 2.8

  • Recent PyTorch nightly + fbgemm_gpu nightly -> This also works bc fbgemm latest nightly is built on torch latest nightly, i.e. fbgemm_gpu nightly.2025.08.25 works with torch nightly.2025.08.25. But it also means that an old fbgemm_gpu nightly will likely not work with a newer torch nightly, and vice versa.

  • PyTorch 2.7.1 and 2.8 + fbgemm_gpu nightly -> This fails naturally, bc nightly depends on the ABI from torch nightly, which is ahead of torch stable releases

  • Recent PyTorch nightly + fbgemm_gpu stable -> Likewise, 1.3.0 stable relies on the ABI from torch 2.8 stable, whereas nightly is ahead

q10 avatar Aug 27 '25 17:08 q10

According to https://docs.pytorch.org/FBGEMM/general/Releases.html I use pytorch 2.9.1, python 3.13.7, cuda 13.0.2 and fbgemm-gpu-genai 1.4.1, however it still

In [1]:     from fbgemm_gpu.experimental.gen_ai.quantize import int4_row_quantize_zp, pack_int4
[11/14/25 20:53:42] ERROR    Could not load the library 'experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so'!                                                                     __init__.py:104


                             Could not load this library: /home/wzy/.local/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so

A more detailed repro:

In [4]:  torch.ops.load_library('/home/wzy/.local/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.13/site-packages/torch/_ops.py", line 1490, in load_library
    raise OSError(f"Could not load this library: {path}") from e
OSError: Could not load this library: /home/wzy/.local/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so

Freed-Wu avatar Nov 14 '25 13:11 Freed-Wu

What is the directory you are currently in? If you're in the project root directory, it may cause Python import confusion

q10 avatar Nov 15 '25 06:11 q10

What is the directory you are currently in?

/dev/shm, not the project directory.

Freed-Wu avatar Nov 15 '25 14:11 Freed-Wu

According to https://docs.pytorch.org/FBGEMM/general/Releases.html I use pytorch 2.9.1, python 3.13.7, cuda 13.0.2 and fbgemm-gpu-genai 1.4.1, however it still

In [1]:     from fbgemm_gpu.experimental.gen_ai.quantize import int4_row_quantize_zp, pack_int4
[11/14/25 20:53:42] ERROR    Could not load the library 'experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so'!                                                                     __init__.py:104


                             Could not load this library: /home/wzy/.local/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so

A more detailed repro:

In [4]:  torch.ops.load_library('/home/wzy/.local/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.13/site-packages/torch/_ops.py", line 1490, in load_library
    raise OSError(f"Could not load this library: {path}") from e
OSError: Could not load this library: /home/wzy/.local/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so

Could you paste out the full log of the error? The underlying error should have been printed out as well.

q10 avatar Nov 15 '25 23:11 q10