Is it possible to run Llama 2 70B with 80Gb?
I'm trying to finetune Llama2 70B using a NVIDIA A100 with 80Gb, but even with batch-size = 1 I'm getting OOM error.
I'm using LoRA with quantization this way: plugins = BitsandbytesPrecision('nf4-dq', torch.bfloat16)
Am I missing something?
I'm also having a similar problem with LLaMA 33B using an NVIDIA A100 80G * 2, even with a single micro-batch size. It is really confusing because when I execute the LoRA finetuning with a single GPU, it runs with 70GB of memory, but when I execute the same finetuning with the same configuration with a double GPU, it runs to OOM.
It seems that the FSDP strategy does not actually shards the pre-trained weights, but only the trainable parameters. (mentioned in https://github.com/pytorch/pytorch/issues/95805)
So I updated the torch to a nightly version, updated the lightning library, and fixed the FSDP strategy's policy, according to this. (mentioned in https://huggingface.co/docs/peft/accelerate/fsdp and referenced https://github.com/meta-llama/llama-recipes/blob/main/src/llama_recipes/utils/fsdp_utils.py) `
def fsdp_auto_wrap_policy(block: torch.nn.Module):
def lambda_policy_fn(module):
if (
len(list(module.named_children())) == 0
and getattr(module, "weight", None) is not None
and module.weight.requires_grad
):
return True
return False
lambda_policy = partial(lambda_auto_wrap_policy, lambda_fn=lambda_policy_fn)
transformer_wrap_policy = partial(transformer_auto_wrap_policy, transformer_layer_cls={block})
auto_wrap_policy = partial(_or_policy, policies=[lambda_policy, transformer_wrap_policy])
return auto_wrap_policy
auto_wrap_policy = fsdp_auto_wrap_policy(Block)
strategy = FSDPStrategy(auto_wrap_policy=auto_wrap_policy, state_dict_type="full", limit_all_gathers=True,
cpu_offload=False)
`
Then, it seems that the pre-trained model is well-sharded to 2 GPUs, but when the training is started, I still get OOM.
Although the solution above didn't help me out, I hope this might be effective for you.
Plus, I would like to leave a question on (1) Whether the code in lit-gpt/finetune/lora.py can support lora fine-tuning with sharded 'pre-trained' model parameters, not only the sharded 'lora' parameters. (2) If not, is there any method or a workaround to support this? e.g. maybe using naive pipeline parallel. It seems many users are suffering from the same problem, yet an acute solution has not been suggested.
Thanks for the very thorough comment and explanation, and thanks for sending the improved FSDP code along. I remember @awaelchli also looking into something FSDP-related, and maybe that's something that's relevant here.
I think one workaround would be setting cpu_offload=True but that would obviously slow things down a lot.
I think in your case changing the micro batch sizes and context lengths are also not options because like you said it works on a single GPU already. (And you also say that you already use the lowest microbatch size.)
I also tried to finetune a Mixtral 7x7B using A100 gpu and got OOM even with batch_size=1 and micro_batch_size=1. I'm using LoRA.