torch-ccl
torch-ccl copied to clipboard
Missing oneCCL libs in 1.13.100+gpu
Hi! I've installed oneccl_bindings_for_pt==1.13.100+gpu
from https://developer.intel.com/ipex-whl-stable-xpu, but after installing I get a "libccl.so.1 not found" error:
$ python
Python 3.10.4 (main, Oct 26 2022, 02:21:10) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import oneccl_bindings_for_pytorch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/gpfs/home5/robertsc/2D-VQ-AE-2/.venv/py310-XPU/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/__init__.py", line 26, in <module>
from . import _C as ccl_lib
ImportError: libccl.so.1: cannot open shared object file: No such file or directory
It seems like including oneCCL was forgotten in the latest build, because when I check a previous version (1.13.0+cpu) libccl.so.1
is included in oneccl_bindings_for_pytorch
:
$ grep -r libccl.so.1
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/libccl.so.1.0 matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/libccl.so.1 matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/libccl.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so matches
lib/python3.10/site-packages/oneccl_bind_pt-1.13.0+cpu.dist-info/RECORD:oneccl_bindings_for_pytorch/lib/libccl.so.1,sha256=QsFq3umZ-WRQHD69SAZ9ilXdYcEwwZfBVS4b8P48KjQ,4544872
lib/python3.10/site-packages/oneccl_bind_pt-1.13.0+cpu.dist-info/RECORD:oneccl_bindings_for_pytorch/lib/libccl.so.1.0,sha256=QsFq3umZ-WRQHD69SAZ9ilXdYcEwwZfBVS4b8P48KjQ,4544872
[robertsc@int4 py310-AMX]$
But in the 1.13.100+gpu version it's missing:
$ grep -r libccl.so.1
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch_xpu.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so matches
As a temporary fix I can install oneccl-devel==2021.8.0
from pypi, which still bundles it:
$ grep -r libccl.so.1
Binary file lib/cpu_gpu_dpcpp/libccl.so.1.0 matches
Binary file lib/cpu_gpu_dpcpp/libccl.so.1 matches
Binary file lib/cpu_gpu_dpcpp/libccl.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch_xpu.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so matches
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu/libccl.so.1,sha256=Mb1k7Cr0EMbtwcPLheTP5ipnzpMYizaUkqVlKC7SJ-s,4847184
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu/libccl.so.1.0,sha256=Mb1k7Cr0EMbtwcPLheTP5ipnzpMYizaUkqVlKC7SJ-s,4847184
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu_gpu_dpcpp/libccl.so.1,sha256=bYQ16wi5o1aOEmM-x3n2G1-3GVXjVzDsL15XpNRu5u0,7543928
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu_gpu_dpcpp/libccl.so.1.0,sha256=bYQ16wi5o1aOEmM-x3n2G1-3GVXjVzDsL15XpNRu5u0,7543928
Binary file lib/cpu/libccl.so.1.0 matches
Binary file lib/cpu/libccl.so.1 matches
Binary file lib/cpu/libccl.so matches
The default build option is to ship with oneCCL, perhaps this flag was accidentally wrongly set while building the latest version? Could you please re-build with the latest oneCCL version? :)
Edit: same for Intel-MPI, libs and bins are also missing