DeepSpeed
DeepSpeed copied to clipboard
CUDA error: unknown error
When I finetuned the G-LLava on 8 A100s, I met such a problem several times.
The full trace is here https://drive.google.com/file/d/195PO96uWKnx4LE3BWjxm0DsrQxWbj3QP/view?usp=sharing
The script is here https://github.com/pipilurj/G-LLaVA/blob/main/scripts
It worked well for finetuning the first stage using run_alignment.sh. But when I finetuned the second stage using run_qa.sh, I met the aboved problem. Now when I input "nvidia-smi" in the terminal, it shows "Unable to determine the device handle for GPU 0000:4F:00.0: Unknown Error". Can anyone please help me solve my problem? Thank you!
@liuhui0401 - this seems like a cuda error, or a bad state that the GPUs are in. If you power cycle the machine, does nvidia-smi work?
@liuhui0401 - this seems like a cuda error, or a bad state that the GPUs are in. If you power cycle the machine, does nvidia-smi work?
Yes. But if I finetune again, I will meet the same problem again. I don't know the reason.
I see, what cuda version are you using currently and can you try with a newer version as well?
Closing as stale, if you are still hitting this, please comment and we can re-open.