vall-e copied to clipboard
Error when training
I cannot manage to make this work on windows.
Running the following command python -m vall_e.train yaml=config/test/ar.yml
First, I was getting error RuntimeError: Distributed package doesn't have NCCL built in
Seems like NCCL backend of pytorch distributed pacakages is not working on windows.
Found out a workaround to use gloo
backend and added the following code in
def get_free_port():
sock = socket.socket()
sock.bind(("", 0))
return sock.getsockname()[1]
torch.distributed.init_process_group(backend="gloo", rank=0, world_size=1)
Then it returns the following error:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
This is where I hit the brickwall
Platform: windows 11 Python: 3.10.9 torch: 1.11.0+cu113
The easiest route would be to use wsl Ubuntu distro for windows. You can do that by using PowerShell.
wsl --install
Then install the Nvidia drivers inside the wsl. Another route would be to install Docker, and run a DockerFile containing all the drivers. There is a PR open that contains a Jupyter image that does all that. Here is another one that uses the Nvidia's ubuntu distro that supports WSL.
` FROM nvidia/cuda:11.8.0-base-ubuntu22.04
USER root ENV TORCH_HOME=/data/models
RUN apt update && apt install -y --no-install-recommends
&& rm -rf /var/lib/apt/lists/*
RUN python3 -m pip install git+
VOLUME /data/models VOLUME /data/shared
ENTRYPOINT ["/bin/bash", "--login", "-c"] `
The easiest route would be to use wsl Ubuntu distro for windows. You can do that by using PowerShell.
wsl --install
Then install the Nvidia drivers inside the wsl. Another route would be to install Docker, and run a DockerFile containing all the drivers. There is a PR open that contains a Jupyter image that does all that. Here is another one that uses the Nvidia's ubuntu distro that supports WSL.` FROM nvidia/cuda:11.8.0-base-ubuntu22.04
USER root ENV TORCH_HOME=/data/models
RUN apt update && apt install -y --no-install-recommends build-essential ffmpeg git python3 python3-dev python3-pip && rm -rf /var/lib/apt/lists/*
RUN python3 -m pip install git+
VOLUME /data/models VOLUME /data/shared
ENTRYPOINT ["/bin/bash", "--login", "-c"] `
Has anyone tried this on M1 Mac? I had similar error as described by the OP. I will try later this week or next, to run this docker container and update the result here..