tensor2tensor
tensor2tensor copied to clipboard
Multi-GPU gives no speedup for transformer model
Description
I am training a Transformer
model on the Librispeech
dataset using 4 GPUs with 8 CPU-cores.
I have tested the following:
Single-GPU
export CUDA_VISIBLE_DEVICES=0
t2t-trainer \
--worker-gpu=1 \
# ..
Multi-GPU
export CUDA_VISIBLE_DEVICES=0,1,2,3
t2t-trainer \
--worker-gpu=4 \
# ..
Both scripts are working. The training starts and on the surface everything looks okay. However, I am getting a global_step/sec
or just ~2 steps for Multi-GPU, compared to ~9 steps for Single-GPU.
Shouldn't I see a speedup using multiple GPUs? If so: What might be the problem here? Can I trust the log output?
Environment information
OS: Linux #37~16.04.1-Ubuntu SMP Tue Aug 28 10:44:06 UTC 2018 GNU/Linux
$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.10.0
tensorflow-gpu==1.10.1
$ python -V
Python 3.5.6 :: Anaconda, Inc.
This is expected, T2T uses synchronous training, so 1 step with 4 GPUs trains on 4 times bigger effective batch size. See e.g. this paper.
@martinpopel Thank you for the link. What I understood is that I have to consider the effective throughput of batches, or samples per second, is that right?
If I take my example from above this would mean
Single: 9 gstep/s * 16 bsize = 144 samples/s
Multi: 2 gstep/s * 16 bsize * 4 #gpu = 122 samples/s
It would seem that the additional overhead for this dataset does effectively slow down the training .. am I getting something wrong?
Yes, it seems the training throughput with 4 GPUs is lower than with a single GPU in your case. This is strange, usually it is higher (sublinear speedup). Maybe you have a slow interconnection between the GPUs (NVLink). I am also not familiar with Librispeech. However, larger batch size may lead to faster convergence on the dev-set, as explained in the paper I linked.
@martinpopel In the meantime I ran another small experiment: Starting from scratch I had a loss of 1.7 after 2 minutes using one GPU - the same was true for 4 GPUs after 9 minutes. That's not a very scientific test but it seems that Multi-GPU does not help me here - for whatever reason. I'll check into that interconnection topic just to be sure.
any update on using multi-GPUs?
@lkluo Well, what I can say, based on observations, is that 4 GPUs let the model converge faster and overall better. I guess it is due to a larger batch size. From my experience I'd recommend using multiple GPUs and to use larger batch sizes.
@lkluo Well, what I can say, based on observations, is that 4 GPUs let the model converge faster and overall better. I guess it is due to a larger batch size. From my experience I'd recommend using multiple GPUs and to use larger batch sizes.
How long did it take you to reach SOTA on 4 GPUs?
same problems