unsloth
unsloth copied to clipboard
Exception: CUDA error: an illegal memory access was encountered. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
I attempted to serve the original base model of Llama 3.1 in 4-bit, both with and without setting load_in_4bit. Below are my observations.
When load_in_4bit = True:
The model throws the following error:
Exception: CUDA error: an illegal memory access was encountered. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.
However, this behavior does not occur immediately—it happens after the model has processed some initial data. The model also consumes 8 GB of memory.
Code:
max_seq_length = 4200
dtype = None # Auto detection; Float16 for Tesla T4, V100; Bfloat16 for Ampere+
load_in_4bit = True # Use 4-bit quantization to reduce memory usage.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
# token="hf_..." # Required for gated models like Meta-Llama/Llama-2-7b-hf
)
When load_in_4bit = False:
The model runs without errors and uses around 16 GB of memory.
Code:
max_seq_length = 4200
dtype = None # Auto detection; Float16 for Tesla T4, V100; Bfloat16 for Ampere+
load_in_4bit = False # Disable 4-bit quantization.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
# token="hf_..." # Required for gated models like Meta-Llama/Llama-2-7b-hf
)
Based on these findings, it seems that if we trained with load_in_4bit = True, the same issue would persist in our fine-tuned model, as it is inherent to the base model.
I recommend that we should train this model again for load_in_4bit = True
@vhiwase Apologies on the delay! Would you happen to know what dataset you might have been using - it's possible there are some weird out of bounds tokens causing errors
@danielhanchen Apologies for the delay in responding. I'm currently testing the model with results obtained from OCR processing using Azure Document Intelligence. The inputs consist of random chunks of text extracted from various documents.
@vhiwase No worries! Does this happen on other machines? Like in a Colab?
@danielhanchen You are correct that we trained the model on Amazon EC2 G6 Instances, and inference is working fine there. However, we hosted the model inference on a different machine—specifically, Amazon EC2 G6e Instances. Could this be related to the dtype setting?
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+.
Note:
G6 Instances: Feature up to 8 NVIDIA L4 Tensor Core GPUs with 24 GB of memory per GPU, and third generation AMD EPYC processors.
G6e Instances: Feature up to 8 NVIDIA L40S Tensor Core GPUs with 384 GB of total GPU memory (48 GB per GPU), and third generation AMD EPYC processors.
You can reproduce the same error using the unsloth's demo Qwen 2.5 notebook (Qwen 2.5 + Unsloth 2x faster finetuning.ipynb):
- Tested on Google Colab with T4 backend
- Set load_in_4bit to False
load_in_4bit = False
A RuntimeError exception is thrown during training
trainer_stats = trainer.train()
RuntimeError Traceback (most recent call last)
[<ipython-input-7-3d62c575fcfd>](https://xv7edaic91-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20250312-060140_RC00_736085474#) in <cell line: 0>()
----> 1 trainer_stats = trainer.train()
36 frames
[/usr/local/lib/python3.11/dist-packages/unsloth/kernels/utils.py](https://xv7edaic91-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20250312-060140_RC00_736085474#) in matmul_lora(X, W, W_quant, A, B, s, out)
483 reshape = False
484 pass
--> 485 out = torch_matmul(X, W, out = out)
486 if W_quant is not None: del W
487
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@EdGaere you can't load a 7B in 16bits on a T4. You will run out of VRAM. A T4 only has 15GB of VRAM. That's nowhere near sufficient. Also note that The error sometimes returned when you run multiple parallel threads on a GPU, isn't always the root cause for the issue you're having.
If you rerun the notebook you linked to , while making sure that the install cell is as follows:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
!pip install unsloth
else:
# Do this only in Colab notebooks! Otherwise use pip install unsloth
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton transformers cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
!pip install --no-deps unsloth
This will get the latest unsloth-zoo and unsloth pypi versions properly installed.
you'll see a more meaningful runtime error:
I will close this thread for now. If you're still experiencing the same issues with the latest unsloth versions, feel free to comment.