litgpt
litgpt copied to clipboard
OOM when fine-tune the Falcon-7B
I tried to fine-tune Falcon-7B on 8 V100 32GB GPU, and I already set the override_max_seq_length = 200, and I tried both 16-mixed and 32-true, both have the OOM error. Any idea on why does this happen?
File "/opt/conda/envs/felcon/lib/python3.8/site-packages/lightning/fabric/strategies/strategy.py", line 179, in backward self.precision.backward(tensor, module, *args, **kwargs) File "/opt/conda/envs/felcon/lib/python3.8/site-packages/lightning/fabric/plugins/precision/precision.py", line 89, in backward tensor.backward(*args, **kwargs) File "/opt/conda/envs/felcon/lib/python3.8/site-packages/torch/_tensor.py", line 491, in backward torch.autograd.backward( File "/opt/conda/envs/felcon/lib/python3.8/site-packages/torch/autograd/init.py", line 204, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 792.00 MiB. GPU 6 has a total capacty of 31.74 GiB of which 573.12 MiB is free. Including non-PyTorch memory, this process has 31.18 GiB memory in use. Of the allocated memory 27.98 GiB is allocated by PyTorch, and 1.74 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Did you try reducing your micro_batch_size
? We have a guide for OOMs in https://github.com/Lightning-AI/lit-gpt/blob/main/howto/oom.md
Running adapter.py
with current main, falcon-7b, precision=16-true, micro_batch_size=1 should use 22.69 GB max allocated memory
Is it possible that the model requires a lot of memory when it starts up? I am running into similar memory issues for Falcon 7B with lora and adapters. Running on a10G (24GB). Tried all the recommendations in the OOM README. How accurate is the peak memory calculator, since the benchmarks seem to be on A100s with a lot more capacity?
Just the 7B model (no training etc) requires 29 GB with mixed precision. 14.5 GB with true half precision. See the math in https://github.com/Lightning-AI/lit-gpt/issues/159#issuecomment-1599820686
I still get a OOM using current main, here is my setting @carmocca :
Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'block_size': 1024, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}
Number of trainable parameters: 3839186 Number of non trainable parameters: 7216889856 Estimated TFLOPs: 369.44 Measured TFLOPs: 355.28 input_ids shape: torch.Size([1, 577]) input_ids shape: torch.Size([1, 782]) input_ids shape: torch.Size([1, 507]) input_ids shape: torch.Size([1, 185]) input_ids shape: torch.Size([1, 415]) input_ids shape: torch.Size([1, 399]) input_ids shape: torch.Size([1, 454]) input_ids shape: torch.Size([1, 179])
Hi @carmocca I've tried with the recommended changes it works fine for adapter.py
for first epoch but after first epoch get's completed the loss automatically changes to NAN.
Also I observed the model with 16-true, microbatchsize = 1 took 40 GB on A100 to load.
I think the potential issue lies in train function in adapter.py line no 174
156 for iter_num in range(max_iters):
157 if step_count <= warmup_iters:
158 # linear warmup
159 lr = learning_rate * step_count / warmup_iters
160 for param_group in optimizer.param_groups:
161 param_group["lr"] = lr
162
163 iter_t0 = time.time()
164
165 input_ids, targets = get_batch(
166 fabric, train_data, longest_seq_length, longest_seq_ix if iter_num == 0 else None
167 )
168
169 is_accumulating = (iter_num + 1) % gradient_accumulation_iters != 0
170 with fabric.no_backward_sync(model, enabled=is_accumulating):
171 logits = model(input_ids, max_seq_length=max_seq_length, lm_head_chunk_size=128)
172 # shift the targets such that output n predicts token n+1
173 logits[-1] = logits[-1][..., :-1, :]
174 loss = chunked_cross_entropy(logits, targets[..., 1:])
175 fabric.backward(loss / gradient_accumulation_iters)
Hi @carmocca I've tried with the recommended changes it works fine for
adapter.py
for first epoch but after first epoch get's completed the loss automatically changes to NAN.Also I observed the model with 16-true, microbatchsize = 1 took 40 GB on A100 to load.
I think the potential issue lies in train function in adapter.py line no 174
156 for iter_num in range(max_iters): 157 if step_count <= warmup_iters: 158 # linear warmup 159 lr = learning_rate * step_count / warmup_iters 160 for param_group in optimizer.param_groups: 161 param_group["lr"] = lr 162 163 iter_t0 = time.time() 164 165 input_ids, targets = get_batch( 166 fabric, train_data, longest_seq_length, longest_seq_ix if iter_num == 0 else None 167 ) 168 169 is_accumulating = (iter_num + 1) % gradient_accumulation_iters != 0 170 with fabric.no_backward_sync(model, enabled=is_accumulating): 171 logits = model(input_ids, max_seq_length=max_seq_length, lm_head_chunk_size=128) 172 # shift the targets such that output n predicts token n+1 173 logits[-1] = logits[-1][..., :-1, :] 174 loss = chunked_cross_entropy(logits, targets[..., 1:]) 175 fabric.backward(loss / gradient_accumulation_iters)
Same issue, do you solve it?
NaNs are likely to occur with 16-true precision: https://github.com/Lightning-AI/lit-gpt/issues/291#issuecomment-1645396074