espnet icon indicating copy to clipboard operation
espnet copied to clipboard

Dual GPU's training is not gaining in speed? (egs2/ljspeech/train_vits)

Open BillDH2k opened this issue 1 year ago • 7 comments

I installed 2nd GPU (now 2x RTX 3090 24GB) and confused if I am gaining speed in training.

With single GPU, my average "train_time" was ~1.02s, reported during each 50 batch interval. With 2x GPU (enabled with "--ngpu 2"), the average "train_time" was ~1.0s, identical to the single GPU case! The 2nd GPU was definitely in use, according to train.log, as well as by "nvidia-smi" command (both GPU were busy, up to 70% level).

In both cases, the batch_bins = 5,000,000. batch_type = numel.

Am I missing something here?

Thanks in advance for your answer!

BillDH2k avatar Jul 22 '23 04:07 BillDH2k

Please increase the batch_bins times multiplied by the number of GPUs (i.e., batch_bins = 10,000,000 in your case). (This is the common usage for multiple GPUs in various frameworks)

sw005320 avatar Jul 22 '23 11:07 sw005320

I guess I am confused with the "Estimated time to finish" reported at end of each epoch run. For both of my 1x GPU and 2x GPUs cases, the estimated completion time was about the same. If I double batch_bins to 10 millions for dual GPUs, it would take even longer to complete each epoch, thus longer to finish the entire task (1000 epochs).

So I must have misunderstanding here. If I double the batch_bins numbers for 2x GPU setup, do I effectively achieve the same results with half of epoch runs compared to single GPU setup?

BillDH2k avatar Jul 22 '23 15:07 BillDH2k

If I double batch_bins to 10 million for dual GPUs, it would take even longer to complete each epoch

This would be due to some inefficient GPU computing.

  • Please optimize the batch size and GPU utilization (otherwise, it quite often happens).
  • The file access may also cause this, which can be mitigated by increasing the number of workers, but if your disk is slow, it may not fully fix the issue.
  • If accmu_grad is more than 2, you may divide it by 2.

You may also check https://espnet.github.io/espnet/espnet2_training_option.html#the-relation-between-mini-batch-size-and-number-of-gpus.

sw005320 avatar Jul 22 '23 16:07 sw005320

I did a quick test, with different batch_bins values. All have accum_grad=1. Still very confused with the results:

  "batch_bins"    "train_time "  "Estimated time to finish"

1x GPU Test 1: 5,000,000 ~ 1.0s ~ 1 wk 4 days
Test 2: 2,500,000 ~ 0.78s. ~ 1 wk 2 days Test 3: 1,250,000. ~ 0.73s ~ 1 wk 1 days

2X GPU Test 4: 10,000,000. ~1.2s. ~ 1 wk 6 days Test 5: 5,000,000. ~0.99s. ~ 1 wk 4 days Test 6: 2,500,000 ~0.95s. ~ 1 wk 4 days

Also tried the following variations, made little difference in speed: "num_workers" 4 or 8; Replaced the regular SSD with a NVME SSD (Samsung Pro 980). My hardware is not so modern, but still capable: Supermicro X9DAI, Dual 2680V2 Xeons, 128G RAM, 2x RTX 3090.

Base on these numbers, my dual GPUs is slower than single GPU?!

BillDH2k avatar Jul 23 '23 04:07 BillDH2k

Thanks for the detailed report. So, the GPU usage was also always busy, right? (GPU utilizations in nvidia-smi was always over 70%)

I think @kan-bayashi or @soumimaiti can answer it.

sw005320 avatar Jul 23 '23 04:07 sw005320

Here is a sample of GPU activity during 1X GPU and 2X GPU training (over 30, 60 sec spans, based on output readings from nvidia-smi): 1X_GPU 2X_GPU

BillDH2k avatar Jul 23 '23 21:07 BillDH2k

Hi, Is there any update on the GPU training problem? I have met the similar problem on low GPU utilizations (just the same as the pictures above). I want to know is this a normal GPU utilization status for GPU training in espnet2 tts? Is there any way to optimize the training speed by making gpu utilizations more effective? (I have tried to increase model complexity or batch size, but it seems help little regarding gpu utiliazations).

emesoohc2022 avatar Dec 23 '23 13:12 emesoohc2022