torch-ccl icon indicating copy to clipboard operation
torch-ccl copied to clipboard

Missing oneCCL libs in 1.13.100+gpu

Open robogast opened this issue 1 year ago • 1 comments

Hi! I've installed oneccl_bindings_for_pt==1.13.100+gpu from https://developer.intel.com/ipex-whl-stable-xpu, but after installing I get a "libccl.so.1 not found" error:

$ python                                                                                                                                                                                                                                                  
Python 3.10.4 (main, Oct 26 2022, 02:21:10) [GCC 11.3.0] on linux                                                                                                                                                                                                                                                                                                                   
Type "help", "copyright", "credits" or "license" for more information.                                                                                                                                                                                                                                                                                                              
>>> import oneccl_bindings_for_pytorch                                                                                                                                                                                                                                                                                                                                              
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                  
  File "<stdin>", line 1, in <module>                                                                                                                                                                                                                                                                                                                                               
  File "/gpfs/home5/robertsc/2D-VQ-AE-2/.venv/py310-XPU/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/__init__.py", line 26, in <module>                                                                                                                                                                                                                                 
    from . import _C as ccl_lib                                                                                                                                                                                                                                                                                                                                                     
ImportError: libccl.so.1: cannot open shared object file: No such file or directory 

It seems like including oneCCL was forgotten in the latest build, because when I check a previous version (1.13.0+cpu) libccl.so.1 is included in oneccl_bindings_for_pytorch:

$ grep -r libccl.so.1
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/libccl.so.1.0 matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/libccl.so.1 matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/libccl.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so matches
lib/python3.10/site-packages/oneccl_bind_pt-1.13.0+cpu.dist-info/RECORD:oneccl_bindings_for_pytorch/lib/libccl.so.1,sha256=QsFq3umZ-WRQHD69SAZ9ilXdYcEwwZfBVS4b8P48KjQ,4544872
lib/python3.10/site-packages/oneccl_bind_pt-1.13.0+cpu.dist-info/RECORD:oneccl_bindings_for_pytorch/lib/libccl.so.1.0,sha256=QsFq3umZ-WRQHD69SAZ9ilXdYcEwwZfBVS4b8P48KjQ,4544872
[robertsc@int4 py310-AMX]$ 

But in the 1.13.100+gpu version it's missing:

$ grep -r libccl.so.1
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch_xpu.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so matches

As a temporary fix I can install oneccl-devel==2021.8.0 from pypi, which still bundles it:

$ grep -r libccl.so.1
Binary file lib/cpu_gpu_dpcpp/libccl.so.1.0 matches
Binary file lib/cpu_gpu_dpcpp/libccl.so.1 matches
Binary file lib/cpu_gpu_dpcpp/libccl.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch_xpu.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so matches
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu/libccl.so.1,sha256=Mb1k7Cr0EMbtwcPLheTP5ipnzpMYizaUkqVlKC7SJ-s,4847184
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu/libccl.so.1.0,sha256=Mb1k7Cr0EMbtwcPLheTP5ipnzpMYizaUkqVlKC7SJ-s,4847184
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu_gpu_dpcpp/libccl.so.1,sha256=bYQ16wi5o1aOEmM-x3n2G1-3GVXjVzDsL15XpNRu5u0,7543928
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu_gpu_dpcpp/libccl.so.1.0,sha256=bYQ16wi5o1aOEmM-x3n2G1-3GVXjVzDsL15XpNRu5u0,7543928
Binary file lib/cpu/libccl.so.1.0 matches
Binary file lib/cpu/libccl.so.1 matches
Binary file lib/cpu/libccl.so matches

The default build option is to ship with oneCCL, perhaps this flag was accidentally wrongly set while building the latest version? Could you please re-build with the latest oneCCL version? :)

Edit: same for Intel-MPI, libs and bins are also missing

robogast avatar Feb 27 '23 12:02 robogast