DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

CUDA error: unknown error

Open liuhui0401 opened this issue 1 year ago • 3 comments

When I finetuned the G-LLava on 8 A100s, I met such a problem several times.

The full trace is here https://drive.google.com/file/d/195PO96uWKnx4LE3BWjxm0DsrQxWbj3QP/view?usp=sharing

The script is here https://github.com/pipilurj/G-LLaVA/blob/main/scripts

It worked well for finetuning the first stage using run_alignment.sh. But when I finetuned the second stage using run_qa.sh, I met the aboved problem. Now when I input "nvidia-smi" in the terminal, it shows "Unable to determine the device handle for GPU 0000:4F:00.0: Unknown Error". Can anyone please help me solve my problem? Thank you!

liuhui0401 avatar Apr 30 '24 04:04 liuhui0401

@liuhui0401 - this seems like a cuda error, or a bad state that the GPUs are in. If you power cycle the machine, does nvidia-smi work?

loadams avatar Apr 30 '24 21:04 loadams

@liuhui0401 - this seems like a cuda error, or a bad state that the GPUs are in. If you power cycle the machine, does nvidia-smi work?

Yes. But if I finetune again, I will meet the same problem again. I don't know the reason.

liuhui0401 avatar May 01 '24 01:05 liuhui0401

I see, what cuda version are you using currently and can you try with a newer version as well?

loadams avatar May 06 '24 16:05 loadams

Closing as stale, if you are still hitting this, please comment and we can re-open.

loadams avatar Aug 14 '24 21:08 loadams