llama-lora-fine-tuning
llama-lora-fine-tuning copied to clipboard
Expected a cuda device, but got: cpu
When I try to resume training from a checkpoint, it says ValueError: Expected a cuda device, but got: cpu. How do I fix this?
please check verson: flash_attn==1.0.5 bitsandbytes==0.37.2 fschat==0.2.10 and check pytorch: python import torch print("torch.cuda.is_available:",torch.cuda.is_available()) exit()
I was able to resume training from checkpoint. However, the loss after resuming explodes drastically. All I did was running the exact same fine tuning script in README.md without changing anything. Do I have to modify anything to resume training from checkpoint?
You can see the image attached— from epoch 0.26 to 0.29, the loss explodes.
Probably related to bitsandbytes, sending all logs to the [email protected] if possible
Were you able to resume training from checkpoints without this problem?