Cuda OOM when fine-tuning 13B
I was able to fine-tune 7B model with one A100-40G GPU but ran into OOM when fine-tuning 13B.
Here is the error message:
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/nn/modules.py:268 in _save_to_state_dict │
│ │
│ 265 │ │ │
│ 266 │ │ try: │
│ 267 │ │ │ if reorder_layout: │
│ ❱ 268 │ │ │ │ self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices) │
│ 269 │ │ │ │
│ 270 │ │ │ super()._save_to_state_dict(destination, prefix, keep_vars) │
│ 271 │
│ │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:100 in undo_layout │
│ │
│ 97 │ outputs[tile_indices.flatten()] = tensor │
│ 98 │ outputs = outputs.reshape(tile_rows, tile_cols, cols // tile_cols, rows // tile_rows │
│ 99 │ outputs = outputs.permute(3, 0, 2, 1) # (rows // tile_rows, tile_rows), (cols // ti │
│ ❱ 100 │ return outputs.reshape(rows, cols).contiguous() │
│ 101 │
│ 102 │
│ 103 class MatMul8bit(torch.autograd.Function): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 39.41 GiB total capacity; 35.83 GiB already allocated; 34.50 MiB free; 38.17 GiB reserved in
total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
My command is:
python finetune.py \
--base_model='decapoda-research/llama-13b-hf' \
--num_epochs=5 \
--cutoff_len=512 \
--group_by_length \
--output_dir='./alpaca-lora-saved-model-13b' \
--lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
--lora_r=16 \
--micro_batch_size=8
I tried a couple of times, e.g., with a smaller cutoff_len but still got the same OOM error. One thing I noticed is that the issue happened after training ~10% steps. Any thoughts or help is greatly appreciated.
Running into the same issue. Getting OOM after 7-10% while running on 4x A100-40GB.
Started at --micro_batch_size=24 and have been reducing it till 8 and it still OOMs at around 10%
Running with
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 torchrun finetune.py \
--base_model="yahma/llama-13b-hf" \
--num_epochs=5 \
--cutoff_len=512 \
--data_path="dataset1.json"
--output_dir='./alpaca-lora-saved-model-13b' \
--lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
--lora_r=16 \
--micro_batch_size=8
Any ideas?
Error is
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 39.44 GiB total capacity; 36.06 GiB already allocated; 19.88 MiB free; 37.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management
Should I play around with max_split_size_mb or should I look in another direction?
Dataset is around 60% larger than the latest alpaca_cleaned.
Another thing I noticed is the ETA for micro_batch_size=24 is is almost the same as for micro_batch_size=8.
I retrained 7b without any issues. For 13B, I tried a couple of things but to no avail:
- Use a smaller
cutoff_len = 256 - Use a smaller
batch_size = 64
Here is the wandb link to a job: https://wandb.ai/zhengfei-hit/huggingface/runs/6mt3ilc0/overview?workspace=user-zhengfei-hit.
The job used only 45% GPU memory before OOM.
Tried setting max_split_size_mb to 128mb and 64mb. Still didn't help, errors out at 10% when I think it is checkpointing or something
Yes
I retrained 7b without any issues. For 13B, I tried a couple of things but to no avail:
1. Use a smaller `cutoff_len = 256` 2. Use a smaller `batch_size = 64`Here is the wandb link to a job: https://wandb.ai/zhengfei-hit/huggingface/runs/6mt3ilc0/overview?workspace=user-zhengfei-hit.
The job used only 45% GPU memory before OOM.
It errored out at 10% after doing a checkpoint? (I think)
Usually errors out when it reaches 200 iterations. @tloen What do you think?
I rented 8x RTX 3090 and getting same issue there. At 10% or 200 iterations it errors out
Always on 200 iterations...
{'loss': 1.5662, 'learning_rate': 2.3999999999999997e-05, 'epoch': 0.02}
{'loss': 1.521, 'learning_rate': 5.399999999999999e-05, 'epoch': 0.04}
{'loss': 1.3948, 'learning_rate': 8.4e-05, 'epoch': 0.06}
{'loss': 1.1799, 'learning_rate': 0.00011399999999999999, 'epoch': 0.08}
{'loss': 1.079, 'learning_rate': 0.00014399999999999998, 'epoch': 0.1}
{'loss': 1.0344, 'learning_rate': 0.00017399999999999997, 'epoch': 0.12}
{'loss': 1.0017, 'learning_rate': 0.000204, 'epoch': 0.14}
{'loss': 0.9883, 'learning_rate': 0.000234, 'epoch': 0.16}
{'loss': 0.9856, 'learning_rate': 0.00026399999999999997, 'epoch': 0.18}
{'loss': 0.968, 'learning_rate': 0.000294, 'epoch': 0.2}
{'loss': 0.9682, 'learning_rate': 0.00029830747531734835, 'epoch': 0.22}
{'loss': 0.965, 'learning_rate': 0.0002961918194640338, 'epoch': 0.24}
{'loss': 0.9425, 'learning_rate': 0.0002940761636107193, 'epoch': 0.26}
{'loss': 0.9679, 'learning_rate': 0.00029196050775740477, 'epoch': 0.28}
{'loss': 0.9681, 'learning_rate': 0.0002898448519040903, 'epoch': 0.3}
{'loss': 0.9561, 'learning_rate': 0.0002877291960507757, 'epoch': 0.32}
{'loss': 0.95, 'learning_rate': 0.0002856135401974612, 'epoch': 0.34}
{'loss': 0.9364, 'learning_rate': 0.00028349788434414665, 'epoch': 0.36}
{'loss': 0.9579, 'learning_rate': 0.00028138222849083215, 'epoch': 0.38}
{'loss': 0.9366, 'learning_rate': 0.0002792665726375176, 'epoch': 0.39}
{'eval_loss': 0.9465365409851074, 'eval_runtime': 43.0107, 'eval_samples_per_second': 46.5, 'eval_steps_per_second': 0.744, 'epoch': 0.39}
13%|___________ | 200/1518 [34:55<3:48:14, 10.39s/it
I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change any parameters and everything worked.
It's definitely an issue with one of these dependencies, need to pin point which one.
Thank you, fixing the version of bitsandbytes to 0.37.2 resolved the issue for me. (https://github.com/TimDettmers/bitsandbytes/issues/324)
bitsandbytes==0.37.2
Thanks @SerCeMan. Setting bitsandbytes==0.37.2 works for me. So closed it.
Hey, @flyman3046! It might be worth keeping the issue open so that others who are likely to face the OOM issues can see it.
@SerCeMan SG, re-opened it until the issue from bitsandbytes is fixed.
Thank you, fixing the version of
bitsandbytesto 0.37.2 resolved the issue for me. (TimDettmers/bitsandbytes#324)bitsandbytes==0.37.2
Yes, I meet an OOM when fine-tuning 13B on 2 * 3090 24GB. It seems happening while saving model.state_dict. And i solved it by pip bitsandbytes==0.37.2(My bitsandbytes version is 0.38.2, before)
Im on 0.37.2 and it still occurrs.
I can confirm, upgrading bitsandbytes to bitsandbytes==0.37.2 does NOT solve the problem.
The problem still happened when i changed bitsandbytes to v0.37.2
I agree; same problem here even with v0.37.2
I solve my problem adding theses variables to .bashrc
export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
remember to use source .bashrc after.
Changing bitsandbytes==0.37.2 fixed the problem for me. I had bitsandbytes==0.39.0 earlier.