qlora
qlora copied to clipboard
Encounster "RuntimeError: CUDA error: device-side assert triggered" issue to reproduce finetune of scripts/finetune_guanaco_7b.sh
Falied to finetune finetune_guanaco_7b:
File "/home/ubuntu/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 465, in _prepare_decoder_atten tion_mask
combined_attention_mask = _make_causal_mask(
File "/home/ubuntu/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 49, in _make_causal_mask
mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-
Here is the screenshot:
My CDUA is CUDA Version: 11.8 and I use the Nvidia A10 which is 24 GPU memory to finetune this.
reproduced the 7B training using Nvidia A10 at AWS a couple of days ago without any error. Was using AWS-supplied ubuntu 20.04 Pytorch 2.0.0 AMI image.
@JustinZou1 I was getting the same error with decapoda-research/llama-7b-hf
but the error went away using huggyllama/llama-7b
.
@ag1988 I tried huggyllama/llama-7b also, it works.Thanks for you help.