llama-lora-fine-tuning Expected a cuda device, but got: cpu

Expected a cuda device, but got: cpu

Open mvuthegoat opened this issue 2 years ago • 4 comments

When I try to resume training from a checkpoint, it says ValueError: Expected a cuda device, but got: cpu. How do I fix this?

Jun 29 '23 11:06 mvuthegoat

please check verson: flash_attn==1.0.5 bitsandbytes==0.37.2 fschat==0.2.10 and check pytorch: python import torch print("torch.cuda.is_available:",torch.cuda.is_available()) exit()

Jun 30 '23 05:06 little51

I was able to resume training from checkpoint. However, the loss after resuming explodes drastically. All I did was running the exact same fine tuning script in README.md without changing anything. Do I have to modify anything to resume training from checkpoint?

You can see the image attached— from epoch 0.26 to 0.29, the loss explodes.

Jul 01 '23 15:07 mvuthegoat

Probably related to bitsandbytes, sending all logs to the [email protected] if possible

Jul 02 '23 05:07 little51

Were you able to resume training from checkpoints without this problem?

Jul 04 '23 05:07 mvuthegoat

llama-lora-fine-tuning llama-lora-fine-tuning copied to clipboard

Expected a cuda device, but got: cpu

llama-lora-fine-tuning
llama-lora-fine-tuning copied to clipboard