litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data.

Open krishna0125 opened this issue 1 year ago • 3 comments

krishna0125 avatar Jul 07 '23 05:07 krishna0125

Yes, that's true. I had the same problem. It i weird that however you can run the Alpaca 52k without problem.

thanhnew2001 avatar Jul 08 '23 09:07 thanhnew2001

Had the same issue today, tried to run finetune/adapter_v2 for Falcon 7B on a NVIDIA L4 (24 GB RAM)

Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'org': 'tiiuae', 'name': 'falcon-7b', 'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'intermediate_size': 18176, 'condense_ratio': 1, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}

but it exited at the training step:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 21.96 GiB of which 70.88 MiB is free. Including non-PyTorch memory, this process has 21.88 GiB memory in use. Of the allocated memory 21.62 GiB is allocated by PyTorch, and 58.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

keurcien avatar Jul 09 '23 08:07 keurcien

Indeed, no luck with 24g VRAM. Some big dataset even failed in 80g machine. I was successful with Alpaca 52k in 80g machine. Two options: reduce the dataset size and number of epochs.

The other approach is trying to use qlora which support 4bit data type. Unfortunately lit parrot does not support it.

I hope to have time to write a detailed tutorials about this in few days.

Thanh

On Sun, 9 Jul 2023 at 15:11 keurcien @.***> wrote:

Had the same issue today, tried to run finetune/adapter_v2 for Falcon 7B on a NVIDIA L4 (24 GB RAM)

Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'org': 'tiiuae', 'name': 'falcon-7b', 'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'intermediate_size': 18176, 'condense_ratio': 1, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}

but it exited at the training step:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 21.96 GiB of which 70.88 MiB is free. Including non-PyTorch memory, this process has 21.88 GiB memory in use. Of the allocated memory 21.62 GiB is allocated by PyTorch, and 58.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/lit-gpt/issues/240#issuecomment-1627642971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY4GWBCPZSDBBNCY7E4RTTXPJRS7ANCNFSM6AAAAAA2BKWRM4 . You are receiving this because you commented.Message ID: @.***>

-- Best regards, Thanh

thanhnew2001 avatar Jul 09 '23 09:07 thanhnew2001

You can try the suggestions described in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/oom.md

carmocca avatar Jul 12 '23 14:07 carmocca