wav2letter icon indicating copy to clipboard operation
wav2letter copied to clipboard

Stuck when running on multi-GPU

Open tryanbot opened this issue 4 years ago • 19 comments

Question

Hi, could someone help me why my train is stucked when using multi-gpu?

Additional Context

image image

Stuck, no training iteration done (I am using 1 iteration only)

here are the .cfg --datadir=/home/user/data/audio/ --rundir=/home/user/data/audio/ --archdir=/home/user/dev/wav2letter/tutorials/1-librispeech_clean/ --train=lists/train-clean-100.lst --valid=lists/dev-clean.lst --input=flac --arch=network.arch --tokens=/home/user/data/audio/am/tokens.txt --lexicon=/home/user/data/audio/am/lexicon.txt --criterion=ctc --lr=0.1 --maxgradnorm=1.0 --replabel=1 --surround=| --onorm=target --sqnorm=true --mfsc=true --filterbanks=40 --nthread=14 --batchsize=32 --runname=librispeech_clean_trainlogs --iter=1 --logtostderr=1 --minloglevel=0 --enable_distributed=True --reportiters=1

tryanbot avatar Aug 10 '20 13:08 tryanbot

What happens in your log while hanging? seems your batch size is very large, could you try 4 just for test?

tlikhomanenko avatar Aug 10 '20 17:08 tlikhomanenko

I am be able to start the training with only 1 gpu, so I think the batch size is not a problem. but I tried the batch_size 4, and it's still hanging. The log with multiple GPU is empty. like no running had happened image while the log with only 1 gpu is image

it seems that multi-gpu training cannot start the training, only book the computing resources Please help

tryanbot avatar Aug 10 '20 18:08 tryanbot

additional information, maybe it can help (or not) image this is the message when I interupt the process (after hanging for a while)

tryanbot avatar Aug 10 '20 18:08 tryanbot

My suggestion @tryanbot would be to try by increasing the iter (I saw that u tried with reducing batchsize to 4).

Dr-AyanDebnath avatar Aug 11 '20 03:08 Dr-AyanDebnath

I tried by using 1 million before. the iteration number may not be a problem because the script run well in 1 gpu (regardless batchsize and iteration).

tryanbot avatar Aug 11 '20 03:08 tryanbot

are you using similar command to run in gpu ? mpirun --allow-run-as-root -n 4 /root/wav2letter/build/Train train -enable_distributed true --flagsfile /home/train.cfg --minloglevel=0 --logtostderr=1

Dr-AyanDebnath avatar Aug 11 '20 03:08 Dr-AyanDebnath

@Dr-AyanDebnath you saw my initial message right? I explain my command there. anyway, I tried your command image still stuck image only book the gpu but not running I think the problem maybe on dependencies, please help

tryanbot avatar Aug 11 '20 03:08 tryanbot

Could you run just tests for the flashlight (go to build and run make test)? There are a tests for distributed things, just to be sure they are working for you.

cc @jacobkahn

tlikhomanenko avatar Aug 11 '20 04:08 tlikhomanenko

could you please send me the link material to read so I can do that by my self?

edit: sorry for not reading your comment carefully @tlikhomanenko image this is the result

tryanbot avatar Aug 11 '20 04:08 tryanbot

Update : I solve the failed test case, now the result is image However the multigpu training process is still stucked. I cant see any distributed gpu test on the test case. Please help @tlikhomanenko

tryanbot avatar Aug 12 '20 03:08 tryanbot

Could you check if in flashlight dir all test pass? Distributed test is in flashlight, not in wav2letter.

tlikhomanenko avatar Aug 12 '20 05:08 tlikhomanenko

okay, all passed in flashlight image

tryanbot avatar Aug 12 '20 06:08 tryanbot

Probably the reason in mpi itself. @jacobkahn any idea on this?

tlikhomanenko avatar Aug 12 '20 17:08 tlikhomanenko

@tryanbot — can you run the AllReduceTest with mpirun in the same way you'd start training? Something like

mpirun --allow-run-as-root -n 2 ./AllReduceTest train --enable_distributed true --logtostderr=1

jacobkahn avatar Aug 12 '20 17:08 jacobkahn

Also stuck image image Please help for further assistance @jacobkahn

tryanbot avatar Aug 12 '20 17:08 tryanbot

@tryanbot — this seems like an issue with your setup or other dependencies. Can you build and run the tests here and see what happens? https://github.com/NVIDIA/nccl-tests

jacobkahn avatar Aug 13 '20 15:08 jacobkahn

stuck on here image any idea why this happens @jacobkahn ?

tryanbot avatar Aug 13 '20 16:08 tryanbot

any update on this @jacobkahn @tlikhomanenko ? is there any insight on why and how to solve this?

tryanbot avatar Aug 18 '20 04:08 tryanbot

Ok, seems it is not related to the flashlight and wav2letter itself. Could you try to search this issue at NVIDIA and report them how debug/fix it?

tlikhomanenko avatar Aug 27 '20 16:08 tlikhomanenko