qlora icon indicating copy to clipboard operation
qlora copied to clipboard

Encounster "RuntimeError: CUDA error: device-side assert triggered" issue to reproduce finetune of scripts/finetune_guanaco_7b.sh

Open JustinZou1 opened this issue 1 year ago • 3 comments

Falied to finetune finetune_guanaco_7b:

 File "/home/ubuntu/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 465, in _prepare_decoder_atten                               tion_mask
    combined_attention_mask = _make_causal_mask(
  File "/home/ubuntu/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 49, in _make_causal_mask
    mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-

Here is the screenshot: 1685870412062

1685870434196

My CDUA is CUDA Version: 11.8 and I use the Nvidia A10 which is 24 GPU memory to finetune this.

JustinZou1 avatar Jun 04 '23 09:06 JustinZou1

reproduced the 7B training using Nvidia A10 at AWS a couple of days ago without any error. Was using AWS-supplied ubuntu 20.04 Pytorch 2.0.0 AMI image.

jwnsu avatar Jun 04 '23 16:06 jwnsu

@JustinZou1 I was getting the same error with decapoda-research/llama-7b-hf but the error went away using huggyllama/llama-7b.

ag1988 avatar Jun 08 '23 00:06 ag1988

@ag1988 I tried huggyllama/llama-7b also, it works.Thanks for you help.

JustinZou1 avatar Jun 10 '23 15:06 JustinZou1