lit-llama full finetuning of LLaMA 7B: OOM on A100

Thanks for the great work.

I am trying to fully fine-tune LLaMA 7B on ALPACA using the script in finitune/full.py. I am using 8 A100 GPUs.

According to the documentations, finetuning with the following parameters even with 4 GPUs is supposed to take 36 hrs, it might even takes less memory.

devices = 4 batch_size = 128 // devices micro_batch_size = 4

I even tried reducing the micro_batch_size to 1 but it still runs out of memory. Is there any other setting that I should be adjusting that I am not aware of?

Jun 11 '23 04:06 Mehrnoom

Hm, I definitely remember training it ... could you try the following and see if it works?

micro_batch_size = 2

or

micro_batch_size = 1

Jun 11 '23 13:06 rasbt

I'm also getting OOM errors now Did this about 4 weeks ago with the same parameters and worked fine

also don't have problems finetuning with lora and adapters

Jun 14 '23 23:06 huey2531

I also run into this problem with 4 A100s, even with a small batch size.

Jun 15 '23 01:06 richardsun-voyager

I set the cpu_offload option as true (https://lightning.ai/docs/pytorch/2.0.0/_modules/lightning/pytorch/strategies/fsdp.html) for the FSDP strategy and the training process could continue. But I am not sure how long it would take and whether the performance would be affected.

"strategy = FSDPStrategy(auto_wrap_policy=auto_wrap_policy, activation_checkpointing=Block, cpu_offload=True)"

Jun 16 '23 07:06 richardsun-voyager

@Mehrnoom Hi, do you solve this problem?

Sep 23 '23 16:09 TomasAndersonFang