alpaca-lora Cuda OOM when fine-tuning 13B

I was able to fine-tune 7B model with one A100-40G GPU but ran into OOM when fine-tuning 13B.

Here is the error message:

│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/nn/modules.py:268 in _save_to_state_dict     │
│                                                                                                  │
│   265 │   │                                                                                      │
│   266 │   │   try:                                                                               │
│   267 │   │   │   if reorder_layout:                                                             │
│ ❱ 268 │   │   │   │   self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)    │
│   269 │   │   │                                                                                  │
│   270 │   │   │   super()._save_to_state_dict(destination, prefix, keep_vars)                    │
│   271                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:100 in undo_layout    │
│                                                                                                  │
│    97 │   outputs[tile_indices.flatten()] = tensor                                               │
│    98 │   outputs = outputs.reshape(tile_rows, tile_cols, cols // tile_cols, rows // tile_rows   │
│    99 │   outputs = outputs.permute(3, 0, 2, 1)  # (rows // tile_rows, tile_rows), (cols // ti   │
│ ❱ 100 │   return outputs.reshape(rows, cols).contiguous()                                        │
│   101                                                                                            │
│   102                                                                                            │
│   103 class MatMul8bit(torch.autograd.Function):                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 39.41 GiB total capacity; 35.83 GiB already allocated; 34.50 MiB free; 38.17 GiB reserved in
total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and 
PYTORCH_CUDA_ALLOC_CONF

My command is:

python finetune.py \
    --base_model='decapoda-research/llama-13b-hf' \
    --num_epochs=5 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./alpaca-lora-saved-model-13b' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8

I tried a couple of times, e.g., with a smaller cutoff_len but still got the same OOM error. One thing I noticed is that the issue happened after training ~10% steps. Any thoughts or help is greatly appreciated.

Apr 16 '23 00:04 flyman3046

Running into the same issue. Getting OOM after 7-10% while running on 4x A100-40GB.

Started at --micro_batch_size=24 and have been reducing it till 8 and it still OOMs at around 10%

Running with

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 torchrun finetune.py \
    --base_model="yahma/llama-13b-hf" \
    --num_epochs=5 \
    --cutoff_len=512 \ 
    --data_path="dataset1.json"
    --output_dir='./alpaca-lora-saved-model-13b' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8

Any ideas?

Error is

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 39.44 GiB total capacity; 36.06 GiB already allocated; 19.88 MiB free; 37.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management

Should I play around with max_split_size_mb or should I look in another direction?

Dataset is around 60% larger than the latest alpaca_cleaned.

Another thing I noticed is the ETA for micro_batch_size=24 is is almost the same as for micro_batch_size=8.

Apr 16 '23 17:04 lksysML

I retrained 7b without any issues. For 13B, I tried a couple of things but to no avail:

Use a smaller cutoff_len = 256
Use a smaller batch_size = 64

Here is the wandb link to a job: https://wandb.ai/zhengfei-hit/huggingface/runs/6mt3ilc0/overview?workspace=user-zhengfei-hit.

The job used only 45% GPU memory before OOM.

Apr 16 '23 19:04 flyman3046

Tried setting max_split_size_mb to 128mb and 64mb. Still didn't help, errors out at 10% when I think it is checkpointing or something

Apr 16 '23 20:04 lksysML

Yes

I retrained 7b without any issues. For 13B, I tried a couple of things but to no avail:
1. Use a smaller `cutoff_len = 256`

2. Use a smaller `batch_size = 64`
Here is the wandb link to a job: https://wandb.ai/zhengfei-hit/huggingface/runs/6mt3ilc0/overview?workspace=user-zhengfei-hit.

The job used only 45% GPU memory before OOM.

It errored out at 10% after doing a checkpoint? (I think)

Apr 16 '23 20:04 lksysML

Usually errors out when it reaches 200 iterations. @tloen What do you think?

I rented 8x RTX 3090 and getting same issue there. At 10% or 200 iterations it errors out

Always on 200 iterations...

{'loss': 1.5662, 'learning_rate': 2.3999999999999997e-05, 'epoch': 0.02}
{'loss': 1.521, 'learning_rate': 5.399999999999999e-05, 'epoch': 0.04}
{'loss': 1.3948, 'learning_rate': 8.4e-05, 'epoch': 0.06}
{'loss': 1.1799, 'learning_rate': 0.00011399999999999999, 'epoch': 0.08}
{'loss': 1.079, 'learning_rate': 0.00014399999999999998, 'epoch': 0.1}
{'loss': 1.0344, 'learning_rate': 0.00017399999999999997, 'epoch': 0.12}
{'loss': 1.0017, 'learning_rate': 0.000204, 'epoch': 0.14}
{'loss': 0.9883, 'learning_rate': 0.000234, 'epoch': 0.16}
{'loss': 0.9856, 'learning_rate': 0.00026399999999999997, 'epoch': 0.18}
{'loss': 0.968, 'learning_rate': 0.000294, 'epoch': 0.2}
{'loss': 0.9682, 'learning_rate': 0.00029830747531734835, 'epoch': 0.22}
{'loss': 0.965, 'learning_rate': 0.0002961918194640338, 'epoch': 0.24}
{'loss': 0.9425, 'learning_rate': 0.0002940761636107193, 'epoch': 0.26}
{'loss': 0.9679, 'learning_rate': 0.00029196050775740477, 'epoch': 0.28}
{'loss': 0.9681, 'learning_rate': 0.0002898448519040903, 'epoch': 0.3}
{'loss': 0.9561, 'learning_rate': 0.0002877291960507757, 'epoch': 0.32}
{'loss': 0.95, 'learning_rate': 0.0002856135401974612, 'epoch': 0.34}
{'loss': 0.9364, 'learning_rate': 0.00028349788434414665, 'epoch': 0.36}
{'loss': 0.9579, 'learning_rate': 0.00028138222849083215, 'epoch': 0.38}
{'loss': 0.9366, 'learning_rate': 0.0002792665726375176, 'epoch': 0.39}
{'eval_loss': 0.9465365409851074, 'eval_runtime': 43.0107, 'eval_samples_per_second': 46.5, 'eval_steps_per_second': 0.744, 'epoch': 0.39}
 13%|___________                                                                   | 200/1518 [34:55<3:48:14, 10.39s/it

Apr 16 '23 21:04 lksysML

I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change any parameters and everything worked.

It's definitely an issue with one of these dependencies, need to pin point which one.

Apr 17 '23 09:04 lksysML

Thank you, fixing the version of bitsandbytes to 0.37.2 resolved the issue for me. (https://github.com/TimDettmers/bitsandbytes/issues/324)

bitsandbytes==0.37.2

Apr 17 '23 11:04 SerCeMan

Thanks @SerCeMan. Setting bitsandbytes==0.37.2 works for me. So closed it.

Apr 17 '23 21:04 flyman3046

Hey, @flyman3046! It might be worth keeping the issue open so that others who are likely to face the OOM issues can see it.

Apr 17 '23 23:04 SerCeMan

@SerCeMan SG, re-opened it until the issue from bitsandbytes is fixed.

Apr 18 '23 01:04 flyman3046

Thank you, fixing the version of bitsandbytes to 0.37.2 resolved the issue for me. (TimDettmers/bitsandbytes#324)
bitsandbytes==0.37.2

Yes, I meet an OOM when fine-tuning 13B on 2 * 3090 24GB. It seems happening while saving model.state_dict. And i solved it by pip bitsandbytes==0.37.2(My bitsandbytes version is 0.38.2, before)