litgpt
litgpt copied to clipboard
Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data.
Yes, that's true. I had the same problem. It i weird that however you can run the Alpaca 52k without problem.
Had the same issue today, tried to run finetune/adapter_v2 for Falcon 7B on a NVIDIA L4 (24 GB RAM)
Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'org': 'tiiuae', 'name': 'falcon-7b', 'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'intermediate_size': 18176, 'condense_ratio': 1, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}
but it exited at the training step:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 21.96 GiB of which 70.88 MiB is free. Including non-PyTorch memory, this process has 21.88 GiB memory in use. Of the allocated memory 21.62 GiB is allocated by PyTorch, and 58.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Indeed, no luck with 24g VRAM. Some big dataset even failed in 80g machine. I was successful with Alpaca 52k in 80g machine. Two options: reduce the dataset size and number of epochs.
The other approach is trying to use qlora which support 4bit data type. Unfortunately lit parrot does not support it.
I hope to have time to write a detailed tutorials about this in few days.
Thanh
On Sun, 9 Jul 2023 at 15:11 keurcien @.***> wrote:
Had the same issue today, tried to run finetune/adapter_v2 for Falcon 7B on a NVIDIA L4 (24 GB RAM)
Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'org': 'tiiuae', 'name': 'falcon-7b', 'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'intermediate_size': 18176, 'condense_ratio': 1, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}
but it exited at the training step:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 21.96 GiB of which 70.88 MiB is free. Including non-PyTorch memory, this process has 21.88 GiB memory in use. Of the allocated memory 21.62 GiB is allocated by PyTorch, and 58.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/lit-gpt/issues/240#issuecomment-1627642971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY4GWBCPZSDBBNCY7E4RTTXPJRS7ANCNFSM6AAAAAA2BKWRM4 . You are receiving this because you commented.Message ID: @.***>
-- Best regards, Thanh
You can try the suggestions described in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/oom.md