litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

OOM when fine-tune the Falcon-7B

Open MasterEndless opened this issue 1 year ago • 5 comments

I tried to fine-tune Falcon-7B on 8 V100 32GB GPU, and I already set the override_max_seq_length = 200, and I tried both 16-mixed and 32-true, both have the OOM error. Any idea on why does this happen?

File "/opt/conda/envs/felcon/lib/python3.8/site-packages/lightning/fabric/strategies/strategy.py", line 179, in backward self.precision.backward(tensor, module, *args, **kwargs) File "/opt/conda/envs/felcon/lib/python3.8/site-packages/lightning/fabric/plugins/precision/precision.py", line 89, in backward tensor.backward(*args, **kwargs) File "/opt/conda/envs/felcon/lib/python3.8/site-packages/torch/_tensor.py", line 491, in backward torch.autograd.backward( File "/opt/conda/envs/felcon/lib/python3.8/site-packages/torch/autograd/init.py", line 204, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 792.00 MiB. GPU 6 has a total capacty of 31.74 GiB of which 573.12 MiB is free. Including non-PyTorch memory, this process has 31.18 GiB memory in use. Of the allocated memory 27.98 GiB is allocated by PyTorch, and 1.74 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

MasterEndless avatar Jun 22 '23 18:06 MasterEndless

Did you try reducing your micro_batch_size? We have a guide for OOMs in https://github.com/Lightning-AI/lit-gpt/blob/main/howto/oom.md

Running adapter.py with current main, falcon-7b, precision=16-true, micro_batch_size=1 should use 22.69 GB max allocated memory

carmocca avatar Jun 22 '23 23:06 carmocca

Is it possible that the model requires a lot of memory when it starts up? I am running into similar memory issues for Falcon 7B with lora and adapters. Running on a10G (24GB). Tried all the recommendations in the OOM README. How accurate is the peak memory calculator, since the benchmarks seem to be on A100s with a lot more capacity?

arunbg avatar Jun 23 '23 00:06 arunbg

Just the 7B model (no training etc) requires 29 GB with mixed precision. 14.5 GB with true half precision. See the math in https://github.com/Lightning-AI/lit-gpt/issues/159#issuecomment-1599820686

carmocca avatar Jun 23 '23 01:06 carmocca

I still get a OOM using current main, here is my setting @carmocca :

Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'block_size': 1024, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}

Number of trainable parameters: 3839186 Number of non trainable parameters: 7216889856 Estimated TFLOPs: 369.44 Measured TFLOPs: 355.28 input_ids shape: torch.Size([1, 577]) input_ids shape: torch.Size([1, 782]) input_ids shape: torch.Size([1, 507]) input_ids shape: torch.Size([1, 185]) input_ids shape: torch.Size([1, 415]) input_ids shape: torch.Size([1, 399]) input_ids shape: torch.Size([1, 454]) input_ids shape: torch.Size([1, 179])

MasterEndless avatar Jun 28 '23 17:06 MasterEndless

Hi @carmocca I've tried with the recommended changes it works fine for adapter.py for first epoch but after first epoch get's completed the loss automatically changes to NAN.

image

Also I observed the model with 16-true, microbatchsize = 1 took 40 GB on A100 to load.

I think the potential issue lies in train function in adapter.py line no 174

156     for iter_num in range(max_iters):
157         if step_count <= warmup_iters:
158             # linear warmup
159             lr = learning_rate * step_count / warmup_iters
160             for param_group in optimizer.param_groups:
161                 param_group["lr"] = lr
162 
163         iter_t0 = time.time()
164 
165         input_ids, targets = get_batch(
166             fabric, train_data, longest_seq_length, longest_seq_ix if iter_num == 0 else None
167         )
168 
169         is_accumulating = (iter_num + 1) % gradient_accumulation_iters != 0
170         with fabric.no_backward_sync(model, enabled=is_accumulating):
171             logits = model(input_ids, max_seq_length=max_seq_length, lm_head_chunk_size=128)
172             # shift the targets such that output n predicts token n+1
173             logits[-1] = logits[-1][..., :-1, :]
174             loss = chunked_cross_entropy(logits, targets[..., 1:])
175             fabric.backward(loss / gradient_accumulation_iters)

someshfengde avatar Jun 29 '23 06:06 someshfengde

Hi @carmocca I've tried with the recommended changes it works fine for adapter.py for first epoch but after first epoch get's completed the loss automatically changes to NAN.

image Also I observed the model with 16-true, microbatchsize = 1 took 40 GB on A100 to load.

I think the potential issue lies in train function in adapter.py line no 174

156     for iter_num in range(max_iters):
157         if step_count <= warmup_iters:
158             # linear warmup
159             lr = learning_rate * step_count / warmup_iters
160             for param_group in optimizer.param_groups:
161                 param_group["lr"] = lr
162 
163         iter_t0 = time.time()
164 
165         input_ids, targets = get_batch(
166             fabric, train_data, longest_seq_length, longest_seq_ix if iter_num == 0 else None
167         )
168 
169         is_accumulating = (iter_num + 1) % gradient_accumulation_iters != 0
170         with fabric.no_backward_sync(model, enabled=is_accumulating):
171             logits = model(input_ids, max_seq_length=max_seq_length, lm_head_chunk_size=128)
172             # shift the targets such that output n predicts token n+1
173             logits[-1] = logits[-1][..., :-1, :]
174             loss = chunked_cross_entropy(logits, targets[..., 1:])
175             fabric.backward(loss / gradient_accumulation_iters)

Same issue, do you solve it?

MasterEndless avatar Jul 21 '23 06:07 MasterEndless

NaNs are likely to occur with 16-true precision: https://github.com/Lightning-AI/lit-gpt/issues/291#issuecomment-1645396074

carmocca avatar Jul 21 '23 14:07 carmocca