espnet
espnet copied to clipboard
Dual GPU's training is not gaining in speed? (egs2/ljspeech/train_vits)
I installed 2nd GPU (now 2x RTX 3090 24GB) and confused if I am gaining speed in training.
With single GPU, my average "train_time" was ~1.02s, reported during each 50 batch interval. With 2x GPU (enabled with "--ngpu 2"), the average "train_time" was ~1.0s, identical to the single GPU case! The 2nd GPU was definitely in use, according to train.log, as well as by "nvidia-smi" command (both GPU were busy, up to 70% level).
In both cases, the batch_bins = 5,000,000. batch_type = numel.
Am I missing something here?
Thanks in advance for your answer!
Please increase the batch_bins times multiplied by the number of GPUs (i.e., batch_bins
= 10,000,000 in your case).
(This is the common usage for multiple GPUs in various frameworks)
I guess I am confused with the "Estimated time to finish" reported at end of each epoch run. For both of my 1x GPU and 2x GPUs cases, the estimated completion time was about the same. If I double batch_bins to 10 millions for dual GPUs, it would take even longer to complete each epoch, thus longer to finish the entire task (1000 epochs).
So I must have misunderstanding here. If I double the batch_bins numbers for 2x GPU setup, do I effectively achieve the same results with half of epoch runs compared to single GPU setup?
If I double batch_bins to 10 million for dual GPUs, it would take even longer to complete each epoch
This would be due to some inefficient GPU computing.
- Please optimize the batch size and GPU utilization (otherwise, it quite often happens).
- The file access may also cause this, which can be mitigated by increasing the number of workers, but if your disk is slow, it may not fully fix the issue.
- If
accmu_grad
is more than 2, you may divide it by 2.
You may also check https://espnet.github.io/espnet/espnet2_training_option.html#the-relation-between-mini-batch-size-and-number-of-gpus.
I did a quick test, with different batch_bins values. All have accum_grad=1. Still very confused with the results:
"batch_bins" "train_time " "Estimated time to finish"
1x GPU
Test 1: 5,000,000 ~ 1.0s ~ 1 wk 4 days
Test 2: 2,500,000 ~ 0.78s. ~ 1 wk 2 days
Test 3: 1,250,000. ~ 0.73s ~ 1 wk 1 days
2X GPU Test 4: 10,000,000. ~1.2s. ~ 1 wk 6 days Test 5: 5,000,000. ~0.99s. ~ 1 wk 4 days Test 6: 2,500,000 ~0.95s. ~ 1 wk 4 days
Also tried the following variations, made little difference in speed: "num_workers" 4 or 8; Replaced the regular SSD with a NVME SSD (Samsung Pro 980). My hardware is not so modern, but still capable: Supermicro X9DAI, Dual 2680V2 Xeons, 128G RAM, 2x RTX 3090.
Base on these numbers, my dual GPUs is slower than single GPU?!
Thanks for the detailed report.
So, the GPU usage was also always busy, right?
(GPU utilizations in nvidia-smi
was always over 70%)
I think @kan-bayashi or @soumimaiti can answer it.
Here is a sample of GPU activity during 1X GPU and 2X GPU training (over 30, 60 sec spans, based on output readings from nvidia-smi):
Hi, Is there any update on the GPU training problem? I have met the similar problem on low GPU utilizations (just the same as the pictures above). I want to know is this a normal GPU utilization status for GPU training in espnet2 tts? Is there any way to optimize the training speed by making gpu utilizations more effective? (I have tried to increase model complexity or batch size, but it seems help little regarding gpu utiliazations).