gpt-neox
gpt-neox copied to clipboard
Cuda OOM with 20B model
I am trying to finetune 20B model with APPS dataset with slim weights. The config is identical to the one you provided in the repository with some tweaks (listing them below). But i am constantly getting OOM error.
Changes to the configuration:
- gradient_accumulation_steps: tried different values [1-32]
- train_micro_batch_size_per_gpu: same as gradient_accumulation_steps
- zero_optimization: Only stage 1 works. CPU offload doesn't. Tried changing "reduce_bucket_size" parameter and others accordingly.
- pipe-parallel-size and model-parallel-size: 1x2, 2x2, 4x2. Tried different combinations depending on the number of gpus available.
Setups I tried:
- 2/4 x NVIDIA Tesla V100 32 ГБ NVLink
- 2/4/8 x NVIDIA A100 80 ГБ SXM (NVLink)
The only way it worked was with 8 x NVIDIA A100 80 ГБ SXM. Sadly it failed because of another mistake in configuration (doesn't matter). The thing is that now I have to wait for days or weeks to run the finetuning process again. I am using my university cluster that has only 6 nodes with such configuration that are always occupied.
Could you please comment on how to finetune the model properly with 2 x NVIDIA Tesla V100 32 ГБ NVLink or 2 x NVIDIA A100 80 ГБ SXM? What should be the configuration? Is it even possible?