wav2letter
wav2letter copied to clipboard
Slow training
Hello,
I was able to perform training using libri-100-clean database and I got the expected results.
Currently I am trying to train a transformer acoustic model using 1K data of Libri. I am using Amazon machine with single GPU. I see that the training takes more than 1 week. What is the expected training time on 1K Libri data using single GPU ??
If I increase number of GPU's to 8 will it decrease training time by factor 8 ?
In the next step I am going to train wav2letter++ using my inhouse data. I have 50K hours of training data. It seems that in this case wav2letter++ training can be too slow. Is it possible to use several computers in parallel to perform the training?
Thanks, Alexander.
Thanks a lot, Alexander.
@AlexandderGorodetski
We didn't try to run training for 1k hours on 1 GPU. For 32 GPUs training we have fully trained transformer model in 3 days on 1k hours and for LibriVox with 54k hours in 1-2 weeks, cc @syhw .
If I increase number of GPU's to 8 will it decrease training time by factor 8 ?
In the next step I am going to train wav2letter++ using my inhouse data. I have 50K hours of training data. It seems that in this case wav2letter++ training can be too slow. Is it possible to use several computers in parallel to perform the training?
In flashlight we have almost linear scaling of distributed learning, so yes, if you use 8 gpus you will have 8 times faster time for 1 epoch.
Info about distributed training is here https://fl.readthedocs.io/en/latest/distributed.html, we have this in the Train.cpp (synchronization is done vie rndv path https://github.com/facebookresearch/wav2letter/blob/master/src/common/Defines.cpp#L325), so you need only to start processes which depends on your system and nodes configuration.
@tlikhomanenko is correct. I should add 8 times faster for an epoch doesn't always mean 8 times faster convergence, but under 16-32 GPUs, pretty much so.