vall-e icon indicating copy to clipboard operation
vall-e copied to clipboard

Error when training

Open Cybermb opened this issue 2 years ago • 2 comments

I cannot manage to make this work on windows.

Running the following command python -m vall_e.train yaml=config/test/ar.yml

First, I was getting error RuntimeError: Distributed package doesn't have NCCL built in

Seems like NCCL backend of pytorch distributed pacakages is not working on windows.

Found out a workaround to use gloo backend and added the following code in data.py:

def get_free_port():
    sock = socket.socket()
    sock.bind(("", 0))
    return sock.getsockname()[1]

os.environ["RANK"]="0"
os.environ["WORLD_SIZE"]="1"
os.environ["MASTER_ADDR"]="localhost"
os.environ["MASTER_PORT"]=str(get_free_port())
os.environ["LOCAL_RANK"]="0"

torch.distributed.init_process_group(backend="gloo", rank=0, world_size=1)

Then it returns the following error: RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

This is where I hit the brickwall

Platform: windows 11 Python: 3.10.9 torch: 1.11.0+cu113

Cybermb avatar Feb 11 '23 18:02 Cybermb

The easiest route would be to use wsl Ubuntu distro for windows. You can do that by using PowerShell. wsl --install Then install the Nvidia drivers inside the wsl. Another route would be to install Docker, and run a DockerFile containing all the drivers. There is a PR open that contains a Jupyter image that does all that. Here is another one that uses the Nvidia's ubuntu distro that supports WSL.

` FROM nvidia/cuda:11.8.0-base-ubuntu22.04

USER root ENV TORCH_HOME=/data/models

RUN apt update && apt install -y --no-install-recommends
build-essential
ffmpeg
git
python3
python3-dev
python3-pip
&& rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install git+https://github.com/enhuiz/vall-e.git

VOLUME /data/models VOLUME /data/shared

ENTRYPOINT ["/bin/bash", "--login", "-c"] `

alimaslax avatar Feb 19 '23 15:02 alimaslax

The easiest route would be to use wsl Ubuntu distro for windows. You can do that by using PowerShell. wsl --install Then install the Nvidia drivers inside the wsl. Another route would be to install Docker, and run a DockerFile containing all the drivers. There is a PR open that contains a Jupyter image that does all that. Here is another one that uses the Nvidia's ubuntu distro that supports WSL.

` FROM nvidia/cuda:11.8.0-base-ubuntu22.04

USER root ENV TORCH_HOME=/data/models

RUN apt update && apt install -y --no-install-recommends build-essential ffmpeg git python3 python3-dev python3-pip && rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install git+https://github.com/enhuiz/vall-e.git

VOLUME /data/models VOLUME /data/shared

ENTRYPOINT ["/bin/bash", "--login", "-c"] `

Has anyone tried this on M1 Mac? I had similar error as described by the OP. I will try later this week or next, to run this docker container and update the result here..

anuraagdjain avatar Apr 04 '23 19:04 anuraagdjain