unsloth
unsloth copied to clipboard
OOM issue when finetuning unsloth/llama-3-8b-bnb-4bit on Colab with T4 with 18000 context length
I'm using the unsloth colab notebook to finetune the unsloth/llama-3-8b-bnb-4bit model with data with a max context length of 18000. Whenever I kick off training, it always run out of memory. That doesn't seem to be the case with the yahma/alpaca example. Here's the error:
==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
\\ /| Num examples = 102 | Num Epochs = 5
O^O/ \_/ \ Batch size per device = 2 | Gradient Accumulation steps = 4
\ / Total batch size = 8 | Total steps = 60
"-____-" Number of trainable parameters = 41,943,040
---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
[<ipython-input-7-3d62c575fcfd>](https://localhost:8080/#) in <cell line: 1>()
----> 1 trainer_stats = trainer.train()
13 frames
[/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py](https://localhost:8080/#) in _convert_to_fp32(tensor)
779
780 def _convert_to_fp32(tensor):
--> 781 return tensor.float()
782
783 def _is_fp16_bf16_tensor(tensor):
OutOfMemoryError: CUDA out of memory. Tried to allocate 9.47 GiB. GPU 0 has a total capacity of 14.75 GiB of which 3.78 GiB is free. Process 2116 has 10.95 GiB memory in use. Of the allocated memory 10.79 GiB is allocated by PyTorch, and 23.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Is the longer context length the reason for this to run run out of memory? What's the recommendation in this case to make this fine-tuning job possible
Yes too long contexts will cause OOMs. According to our blog: https://unsloth.ai/blog/llama3, the max context length on Tesla T4s (16GB) is 10K ish