alpa icon indicating copy to clipboard operation
alpa copied to clipboard

Problem in building Alpa-modified Jaxlib.

Open Fonsifa opened this issue 1 year ago • 5 comments

Please describe the bug

Please describe the expected behavior

System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker):
  • Python version: 3.9
  • CUDA version:11.3
  • NCCL version: 8.2.0.53
  • cupy version: cupy-cuda11x 12.2.0
  • GPU : GeForce RTX3090
  • Alpa version: 0.2.3
  • JAX version: 0.3.22

To Reproduce Steps to reproduce the behavior: When I try to install alpa from source, and execute python3 build/build.py --enable_cuda --dev_install --bazel_options=--override_repository=org_tensorflow=$(pwd)/../third_party/tensorflow-alpa, some warnings happened. And I don't know if it's related to the error happened in the second pic.

Screenshots If applicable, add screenshots to help explain your problem. image troubleshoot

Code snippet to reproduce the problem

Additional information Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.

Fonsifa avatar Sep 23 '23 09:09 Fonsifa

this bug caused by wrong version of libnccl i solved it by reinstalling a right ver libnccl and recreating a new python env based on this libnccl

Lssyes avatar Oct 14 '23 01:10 Lssyes

this bug caused by wrong version of libnccl i solved it by reinstalling a right ver libnccl and recreating a new python env based on this libnccl

may i ask your concrete version of python and libnccl, thx

Fonsifa avatar Oct 19 '23 06:10 Fonsifa

yeah python == 3.8.13 gcc == 7.5.0 nccl == libnccl.so.2.8.4

Lssyes avatar Oct 19 '23 08:10 Lssyes

Hi, I am running into the same issue when building from source. I don't understand how libnccl version affects the filenotfound error? Any other solution to this?

ertza avatar Nov 22 '23 05:11 ertza

Hi, I am running into the same issue when building from source. I don't understand how libnccl version affects the filenotfound error? Any other solution to this?

the mirror url is write in some workplace file. it seems the file not found problem not the error reason. the incorrect libnccl version is the main cause.

Fonsifa avatar Nov 22 '23 06:11 Fonsifa