llama-lora-fine-tuning icon indicating copy to clipboard operation
llama-lora-fine-tuning copied to clipboard

Expected a cuda device, but got: cpu

Open mvuthegoat opened this issue 2 years ago • 4 comments

When I try to resume training from a checkpoint, it says ValueError: Expected a cuda device, but got: cpu. How do I fix this?

Screenshot 2023-06-29 at 18 11 02

mvuthegoat avatar Jun 29 '23 11:06 mvuthegoat

please check verson: flash_attn==1.0.5 bitsandbytes==0.37.2 fschat==0.2.10 and check pytorch: python import torch print("torch.cuda.is_available:",torch.cuda.is_available()) exit()

little51 avatar Jun 30 '23 05:06 little51

I was able to resume training from checkpoint. However, the loss after resuming explodes drastically. All I did was running the exact same fine tuning script in README.md without changing anything. Do I have to modify anything to resume training from checkpoint?

You can see the image attached— from epoch 0.26 to 0.29, the loss explodes.

Screenshot 2023-07-01 at 22 32 50

mvuthegoat avatar Jul 01 '23 15:07 mvuthegoat

Probably related to bitsandbytes, sending all logs to the [email protected] if possible

little51 avatar Jul 02 '23 05:07 little51

Were you able to resume training from checkpoints without this problem?

mvuthegoat avatar Jul 04 '23 05:07 mvuthegoat