FloWaveNet
FloWaveNet copied to clipboard
DataParralel takes too long
Hello,
I am trying to run the training part on multiple GPUs (4 Tesla V100), using the command
python train.py --model_name flowavenet --batch_size 8 --n_block 8 --n_flow 6 --n_layer 2 --block_per_split 4 --num_gpu 4
It runs everything without an error and outputs
num_gpu > 1 detected. converting the model to DataParallel...
It was frozen with this output for more than 1 hour. I checked the usage of the GPUs and all of them were used, but I didn't see any change. I have several questions: do I have some problem with the code or I have to wait more for the training to start? Will decrease in batch_size increase the speed of conversion to DataParallel?
Note* I run training on LJ-Speech-Dataset
Also, can you give us the download links of the pretrained models? It would be very helpful.
Sorry for the late reply. The >1 hour hang is indeed strange and shouldn't happen (the default stdout logging interval is 100 (display_step
)). Could you test again with display_step = 1
inside train()
? Or, could you verify that DistributedDataParallel
from @1ytic alleviates the problem?