unsloth Exception: CUDA error: an illegal memory access was encountered. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

I attempted to serve the original base model of Llama 3.1 in 4-bit, both with and without setting load_in_4bit. Below are my observations.

When load_in_4bit = True: The model throws the following error:

Exception: CUDA error: an illegal memory access was encountered. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

However, this behavior does not occur immediately—it happens after the model has processed some initial data. The model also consumes 8 GB of memory.

Code:

max_seq_length = 4200  
dtype = None  # Auto detection; Float16 for Tesla T4, V100; Bfloat16 for Ampere+
load_in_4bit = True  # Use 4-bit quantization to reduce memory usage.
 
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    # token="hf_..."  # Required for gated models like Meta-Llama/Llama-2-7b-hf
)

When load_in_4bit = False: The model runs without errors and uses around 16 GB of memory.

Code:

max_seq_length = 4200 
dtype = None  # Auto detection; Float16 for Tesla T4, V100; Bfloat16 for Ampere+
load_in_4bit = False  # Disable 4-bit quantization.

model, tokenizer = FastLanguageModel.from_pretrained(
   model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
   max_seq_length=max_seq_length,
   dtype=dtype,
   load_in_4bit=load_in_4bit,
   # token="hf_..."  # Required for gated models like Meta-Llama/Llama-2-7b-hf
)

Based on these findings, it seems that if we trained with load_in_4bit = True, the same issue would persist in our fine-tuned model, as it is inherent to the base model.

I recommend that we should train this model again for load_in_4bit = True

Sep 24 '24 11:09 vhiwase

@vhiwase Apologies on the delay! Would you happen to know what dataset you might have been using - it's possible there are some weird out of bounds tokens causing errors

Oct 01 '24 08:10 danielhanchen

@danielhanchen Apologies for the delay in responding. I'm currently testing the model with results obtained from OCR processing using Azure Document Intelligence. The inputs consist of random chunks of text extracted from various documents.

Oct 14 '24 08:10 vhiwase

@vhiwase No worries! Does this happen on other machines? Like in a Colab?

Oct 18 '24 08:10 danielhanchen

@danielhanchen You are correct that we trained the model on Amazon EC2 G6 Instances, and inference is working fine there. However, we hosted the model inference on a different machine—specifically, Amazon EC2 G6e Instances. Could this be related to the dtype setting?

dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+.

Note:

G6 Instances: Feature up to 8 NVIDIA L4 Tensor Core GPUs with 24 GB of memory per GPU, and third generation AMD EPYC processors.

G6e Instances: Feature up to 8 NVIDIA L40S Tensor Core GPUs with 384 GB of total GPU memory (48 GB per GPU), and third generation AMD EPYC processors.

Oct 24 '24 03:10 vhiwase

You can reproduce the same error using the unsloth's demo Qwen 2.5 notebook (Qwen 2.5 + Unsloth 2x faster finetuning.ipynb):

Tested on Google Colab with T4 backend
Set load_in_4bit to False

load_in_4bit = False

A RuntimeError exception is thrown during training

trainer_stats = trainer.train()

RuntimeError                              Traceback (most recent call last)
[<ipython-input-7-3d62c575fcfd>](https://xv7edaic91-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20250312-060140_RC00_736085474#) in <cell line: 0>()
----> 1 trainer_stats = trainer.train()

36 frames
[/usr/local/lib/python3.11/dist-packages/unsloth/kernels/utils.py](https://xv7edaic91-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20250312-060140_RC00_736085474#) in matmul_lora(X, W, W_quant, A, B, s, out)
    483         reshape = False
    484     pass
--> 485     out = torch_matmul(X, W, out = out)
    486     if W_quant is not None: del W
    487 

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Mar 14 '25 07:03 EdGaere

@EdGaere you can't load a 7B in 16bits on a T4. You will run out of VRAM. A T4 only has 15GB of VRAM. That's nowhere near sufficient. Also note that The error sometimes returned when you run multiple parallel threads on a GPU, isn't always the root cause for the issue you're having.

If you rerun the notebook you linked to , while making sure that the install cell is as follows:

%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton transformers cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

This will get the latest unsloth-zoo and unsloth pypi versions properly installed.

you'll see a more meaningful runtime error:

I will close this thread for now. If you're still experiencing the same issues with the latest unsloth versions, feel free to comment.

Jun 06 '25 20:06 rolandtannous

unsloth unsloth copied to clipboard

Exception: CUDA error: an illegal memory access was encountered. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

unsloth
unsloth copied to clipboard