full finetuning of LLaMA 7B: OOM on A100
Thanks for the great work.
I am trying to fully fine-tune LLaMA 7B on ALPACA using the script in finitune/full.py. I am using 8 A100 GPUs.
According to the documentations, finetuning with the following parameters even with 4 GPUs is supposed to take 36 hrs, it might even takes less memory.
devices = 4 batch_size = 128 // devices micro_batch_size = 4
I even tried reducing the micro_batch_size to 1 but it still runs out of memory. Is there any other setting that I should be adjusting that I am not aware of?
Hm, I definitely remember training it ... could you try the following and see if it works?
micro_batch_size = 2
or
micro_batch_size = 1
I'm also getting OOM errors now Did this about 4 weeks ago with the same parameters and worked fine
also don't have problems finetuning with lora and adapters
I also run into this problem with 4 A100s, even with a small batch size.
I set the cpu_offload option as true (https://lightning.ai/docs/pytorch/2.0.0/_modules/lightning/pytorch/strategies/fsdp.html) for the FSDP strategy and the training process could continue. But I am not sure how long it would take and whether the performance would be affected.
"strategy = FSDPStrategy(auto_wrap_policy=auto_wrap_policy, activation_checkpointing=Block, cpu_offload=True)"
@Mehrnoom Hi, do you solve this problem?