jukebox icon indicating copy to clipboard operation
jukebox copied to clipboard

failure to initialize NCCL

Open metaphorz opened this issue 3 years ago • 3 comments

I think that NCCL is part of PyTorch? I am running Python 3.9 so I had to install torch using

-c=conda-forge

as specified in the instructions for installing torch. It seemed to install correctly.

........

jukebox paul$ python3 jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125 Caught error during NCCL init (attempt 0 of 5): Distributed package doesn't have NCCL built in Caught error during NCCL init (attempt 1 of 5): Distributed package doesn't have NCCL built in Caught error during NCCL init (attempt 2 of 5): Distributed package doesn't have NCCL built in Caught error during NCCL init (attempt 3 of 5): Distributed package doesn't have NCCL built in Caught error during NCCL init (attempt 4 of 5): Distributed package doesn't have NCCL built in Traceback (most recent call last): File "jukebox/sample.py", line 279, in fire.Fire(run)

...bunch more stuff....

raise RuntimeError("Failed to initialize NCCL")

RuntimeError: Failed to initialize NCCL (jukebox) host:jukebox paul$

metaphorz avatar Mar 18 '21 17:03 metaphorz

I was able to pass this issue by installing miniconda Python 3.7, then changing the line 43 in jukebox/utils/dist_utils.py from backend = "nccl" to backend="gloo"

nbinobied avatar Sep 11 '21 18:09 nbinobied

Ran into this problem on my ubuntu 22.04 machine with a 3090 running cuda 11.7. I solved it by doing the following installation. Changed the cudatoolkit version that matches my system and explicitly installed the oldest possible pytorch + gpu capabilities that matches my cuda version as close as possible. I also changed backend to gloo like @nbinobied suggested. Only changing backed didn't work for me.

conda create --name jukebox python=3.7.5
conda activate jukebox
conda install mpi4py=3.0.3 # if this fails, try: pip install mpi4py==3.0.3
conda install pytorch=1.4 torchvision=0.5 cudatoolkit=11.0.221 -c pytorch 
git clone https://github.com/openai/jukebox.git
cd jukebox
pip install -r requirements.txt
pip install -e .
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116

redsphinx avatar Dec 29 '22 01:12 redsphinx

I had this issue on a WSL 2 setup where I followed the README instructions religiously. Using @redsphinx' last line made it work (ie grabbed a version of NCCL that agreed with WSL and didn't clash with other dependencies), with the backend left unchanged:

pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116

My 3090 ti has now generated 6 wonderfully weird music samples. All is well.

itsnotlupus avatar Jan 04 '23 02:01 itsnotlupus