FloWaveNet icon indicating copy to clipboard operation
FloWaveNet copied to clipboard

DataParralel takes too long

Open Ersho opened this issue 5 years ago • 1 comments

Hello,

I am trying to run the training part on multiple GPUs (4 Tesla V100), using the command

python train.py --model_name flowavenet --batch_size 8 --n_block 8 --n_flow 6 --n_layer 2 --block_per_split 4 --num_gpu 4

It runs everything without an error and outputs

num_gpu > 1 detected. converting the model to DataParallel...

It was frozen with this output for more than 1 hour. I checked the usage of the GPUs and all of them were used, but I didn't see any change. I have several questions: do I have some problem with the code or I have to wait more for the training to start? Will decrease in batch_size increase the speed of conversion to DataParallel?

Note* I run training on LJ-Speech-Dataset

Also, can you give us the download links of the pretrained models? It would be very helpful.

Ersho avatar Mar 01 '19 22:03 Ersho

Sorry for the late reply. The >1 hour hang is indeed strange and shouldn't happen (the default stdout logging interval is 100 (display_step)). Could you test again with display_step = 1 inside train()? Or, could you verify that DistributedDataParallel from @1ytic alleviates the problem?

L0SG avatar Apr 23 '19 14:04 L0SG