bert Recommended GPU size when training BERT-base

What is the minimum GPU spec for training the base model?

Obviously I realise it depends on the hyperparameters, but I have a 4GB GPU that I'm trying to train BERT-base on with the run_classifier example, and I'm hitting on out of memory problems. Even if I reduce down to seq_len = 200 and batch_size = 4 I hit on problems, and not much point going below that as the training will most likely collapse.

Evidently 4GB will not suffice and I'll need to upgrade. What are people using successfully and with what seq_len and batch_size?

May 14 '19 12:05 BigBadBurrow

Hey, maybe this will help. With fp16 support I survived the OOM message, even with batch_size=32 (GTX1080 8GB). https://github.com/thorjohnsen/bert/tree/gpu_optimizations

May 16 '19 13:05 AndreasFdev

Thanks @AndreasFdev, I concluded there was no way I'd be able to do training with a 4GB GPU, so I managed to lay my hands on a second-hand Titan X with 12GB - working fine now.

May 23 '19 15:05 BigBadBurrow

@BigBadBurrow What batch size & float precision did you end up on Titan X (12GB)?

Jan 20 '20 16:01 elkotito

@AndreasFdev How do you implement the fp16 support? Use Apex?

Mar 12 '20 21:03 YiweiJiang2015

I have 15g GPU my batch size is 2 and it always collapse

Jul 22 '20 09:07 Foina

I'm tried using different GPU, I always end up receiving out of memory exception as all of the GPU memory is taken up on following cards I have

1050Ti (4gb)
2060 Super (8gb)

For the above I had Operating System - Ubuntu 18.04 CUDA - 10 cuDNN - 7.6 Python - 3.6 tensorflow - 2.3

However, I'm able to run my test on 3060 and 1080Ti only thing that changed is (keeping rest from above as same) CUDA - 11.2 cuDNN - 8 tensorflow - 2.6

I tried changing it to ALBERT or DistilBERT with no option of compiling model or reaching the epochs training in realistic timeframe.

Sep 27 '21 12:09 muhammad-noman-d

@muhammad-noman-d I recommend you to try https://huggingface.co/docs/transformers/index the transformers library from huggingface, their BERT implementation has smart defaults (FP16) and can be combined with https://github.com/microsoft/DeepSpeed Deepspeed significantly reduce training time and hardware requirements

Feb 18 '22 23:02 LifeIsStrange