gpt-neox Cuda OOM with 20B model

Cuda OOM with 20B model

Open gaarutyunov opened this issue 2 years ago • 0 comments

I am trying to finetune 20B model with APPS dataset with slim weights. The config is identical to the one you provided in the repository with some tweaks (listing them below). But i am constantly getting OOM error.

Changes to the configuration:

gradient_accumulation_steps: tried different values [1-32]
train_micro_batch_size_per_gpu: same as gradient_accumulation_steps
zero_optimization: Only stage 1 works. CPU offload doesn't. Tried changing "reduce_bucket_size" parameter and others accordingly.
pipe-parallel-size and model-parallel-size: 1x2, 2x2, 4x2. Tried different combinations depending on the number of gpus available.

Setups I tried:

2/4 x NVIDIA Tesla V100 32 ГБ NVLink
2/4/8 x NVIDIA A100 80 ГБ SXM (NVLink)

The only way it worked was with 8 x NVIDIA A100 80 ГБ SXM. Sadly it failed because of another mistake in configuration (doesn't matter). The thing is that now I have to wait for days or weeks to run the finetuning process again. I am using my university cluster that has only 6 nodes with such configuration that are always occupied.

Could you please comment on how to finetune the model properly with 2 x NVIDIA Tesla V100 32 ГБ NVLink or 2 x NVIDIA A100 80 ГБ SXM? What should be the configuration? Is it even possible?

May 04 '22 16:05 gaarutyunov

gpt-neox gpt-neox copied to clipboard

Cuda OOM with 20B model

gpt-neox
gpt-neox copied to clipboard