piper icon indicating copy to clipboard operation
piper copied to clipboard

Guidance or examples on multi-node training

Open aaronnewsome opened this issue 1 year ago • 2 comments

I've been getting along pretty well training my custom voice for Piper, but I began to wonder more about using multiple nodes, multiple GPUs. I currently don't have any machines that can house multiple GPUs, so each machine can only have 1 GPU in my setup.

I see that pytorch lightning support multi-node/multi-gpu training. I was able to find this thread here:

https://github.com/rhasspy/piper/issues/95

and it mentions a --num_nodes flag but I can't really find any documentation on how to set this up for training with Piper.

Can anyone point me a concise guide for how to train on multiple nodes with Piper?

aaronnewsome avatar Dec 27 '23 15:12 aaronnewsome

Ok, I've made some progress. I wouldn't exactly call it success, but I after much fiddling, I was able to get training running on two nodes. In case someone else comes here looking, at least they'll find this post. Here's what I did. On the first node, I ran this:

NCCL_IB_DISABLE=1 NCCL_SOCKET_IFNAME=br0 MASTER_ADDR=localhost MASTER_PORT=13165 NODE_RANK=0 LOCAL_RANK=0 ./train.sh

Note, the NODE_RANK is 0 and the master is himself.

On the second node, I ran this:

NCCL_IB_DISABLE=1 NCCL_SOCKET_IFNAME=br0 MASTER_ADDR=192.168.112.27 MASTER_PORT=13165 NODE_RANK=1 LOCAL_RANK=0 ./train.sh

Note, the NODE_RANK is 1 and the master is the first server's IP.

Inside train.sh, my piper_train command looks like this:

python3 -m piper_train --num_nodes ${WORLD_SIZE} --accelerator 'gpu' --devices 1 --dataset-dir "${DATASET}" --batch-size 24 --validation-split 0.0 --num-test-examples 0 --max_epochs 4002 --checkpoint-epochs 10 --precision 16 --quality "high" --resume_from_checkpoint ${CHECKPOINT} --strategy ddp

WORLD_SIZE is set to 2, other vars are set accordingly. To my surprise, it actually worked. These are pretty slow GPUs so it's not exactly any faster but nvitop shows both GPUs are getting utilized pretty well.

I don't have any systems where I can get more than one GPU connected, so for my next test, I'm going to try this same setup but with faster GPUs. Yes, I know I could do all of this in cloud systems but this is how I learn stuff.

If anyone has any advice on how to get multi-node training working the best way possible, I'm all ears.

aaronnewsome avatar Jan 16 '24 02:01 aaronnewsome

@aaronnewsome , thanks! I'm on the lookout for someone to help train a quick English tts using the training Colab. (Not for free 🙂) Can we talk?

benthinker avatar Feb 14 '24 16:02 benthinker