gpt-2 Training on distributed machine is slow. Using 8 Nvidia V100.

Training on distributed machine is slow. Using 8 Nvidia V100.

Open dimeldo opened this issue 4 years ago • 8 comments

I'm using aws p3dn.24xlarge to train my data on 8 Nvidia V100 GPU's but the training seem slower than 1 GPU.

This is the config in train-horovod.py:

def train_main(dataset,
               model_name='345M',
               seed=None,
               batch_size=1,
               sample_length=1023,
               sample_num=1,
               sample_every=500,
               run_name='run1',
               restore_from='latest',
               save_every=1000,
               combine=50000):

That's the output, as you can see it takes a long time for each step. Trying to increase the batch size results in OOM.

[1 | 13.96] loss=3.12 avg=3.12
[2 | 16.30] loss=22.49 avg=12.85
[3 | 18.51] loss=8.58 avg=11.41
[4 | 20.70] loss=7.58 avg=10.44
[5 | 23.08] loss=7.59 avg=9.86
[6 | 25.48] loss=6.96 avg=9.36
[7 | 27.52] loss=6.34 avg=8.92
[8 | 29.85] loss=6.26 avg=8.58
[9 | 32.30] loss=5.86 avg=8.26
[10 | 34.31] loss=6.00 avg=8.02
[11 | 36.61] loss=5.78 avg=7.81
[12 | 38.94] loss=5.53 avg=7.61
[13 | 41.25] loss=5.32 avg=7.42
[14 | 43.69] loss=5.06 avg=7.24
[15 | 45.94] loss=6.06 avg=7.16
[16 | 48.34] loss=4.94 avg=7.01
[17 | 50.74] loss=5.16 avg=6.89
[18 | 53.10] loss=4.73 avg=6.76
[19 | 55.21] loss=4.54 avg=6.63
[20 | 57.56] loss=5.09 avg=6.55
[21 | 59.75] loss=4.66 avg=6.45
[22 | 62.22] loss=4.44 avg=6.35
[23 | 64.45] loss=4.40 avg=6.25
[24 | 66.68] loss=3.91 avg=6.14
[25 | 69.04] loss=3.79 avg=6.04

Sep 02 '19 16:09 dimeldo

I think distributed fine-tuning is not possible currently. As you can see it is labeld as "out-of-date". I think you are better off trying to fine-tune with the PyTorch implementation from PyTorch-Transformers.

Sep 04 '19 14:09 HansBambel

What command did you use to make it work? I am using aws ml.p3.8xlarge with four 16 GB V100 GPUs to train but getting OOM error. I am using this command.

mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/MB23.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1

I am assuming my GPUs are not enough to train this 345M with even batch_size 1. Would using more GPU help me or multi-GPU training is just not possible?

Jun 19 '20 18:06 shamiul94

What command did you use to make it work? I am using aws ml.p3.8xlarge with four 16 GB V100 GPUs to train but getting OOM error. I am using this command.
mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/MB23.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1
I am assuming my GPUs are not enough to train this 345M with even batch_size 1. Would using more GPU help me or multi-GPU training is just not possible?

Are you talking about the PyTorch version? I was able to train the 345M version on a single V100.

Jun 20 '20 07:06 HansBambel

Are you talking about the PyTorch version?

No, I am using this code mentioned in this repository which is using Tensorflow instead of Pytorch. Does Pytorch version work better than the Tensorflow version? Also what Pytorch version are you talking about? (Any link would be helpful)

I was able to train the 345M version on a single V100.

Yes, I agree. I also tried to run the 345M model using train.py used in this very repository which also uses Tensorflow. It successfully ran this model on a single V100 but only for --batch_size 1. For batch_size more than 1, it failed. I am trying to find a way to increase batch_size value using multiple GPUs. I was surprised to see that although this model could be trained on a single V100, I got ResourceExhausted error while trying it on multiple GPUs (4xV100). Shouldn't it be the opposite?

I have explained my issues in #53 and #52. It would be helpful if you go through these two issues too. Thank you.

Jun 20 '20 10:06 shamiul94

I can recommend checking out Huggingface-transformers. When I was working with GPT-2 it was only for PyTorch, but they extended the repository to Tensorflow as well. There should be some examples of people doing exactly what you are trying as well.

Best of luck!

Jun 20 '20 10:06 HansBambel

Hi @shamiul94! It's written in my notes that I used this command:

mpirun --allow-run-as-root -np 8 -H localhost:8 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib train-horovod.py --dataset data/train.npz --val_dataset data/valid.npz

Also, as others said, V100 can fit the gpt2-medium model with batch size 1. I also recommend you to move to Hugginface's code like!

Jun 20 '20 16:06 dimeldo

Hi, @dimeldo! Huge thanks for your input!

Yes. But I would like to tweak the batch_size value too. That's why I am considering multi GPU training. Is it possible to set batch_size more than 1 if I use 8xV100? Or do I need more? Or is it not possible at all using Nshepperd's codebase?
I will definitely look into Hugginface's code. Thanks for the suggestion!
I am quite new to this 'multi GPU training' arena. I was getting OOM error while using 4xV100. Would it work if I use 8xV100? I am working on 345M. Which model were you working on?

Jun 21 '20 10:06 shamiul94

I can't quite remember. I think the 345M one. I can't remember if multi-GPU worked out alright in the end or not. Good luck in your research!

Jun 22 '20 11:06 dimeldo

gpt-2 gpt-2 copied to clipboard

Training on distributed machine is slow. Using 8 Nvidia V100.

gpt-2
gpt-2 copied to clipboard