qlora
qlora copied to clipboard
[bug] Completed model does not load from checkpoint / generate produces same as base model
Prerequisite
- Ensure checkpoints save correctly: Apply https://github.com/artidoro/qlora/pull/44 to fix https://github.com/artidoro/qlora/issues/38 and https://github.com/artidoro/qlora/issues/41
Problem
Model has finished training and the output looks like this:
- ✅
checkpoint-#/adapter_model/
directory exists- ✅
adapter_model.bin
is MB not 443 bytes.
- ✅
- ✅
completed
file exists
We want to load from the checkpoint here: https://github.com/artidoro/qlora/blob/f96eec16756f3594fb21971e817989ef14638c10/qlora.py#L309-L311
which needs checkpoint_dir
to be set.
checkpoint_dir
comes from get_last_checkpoint
: https://github.com/artidoro/qlora/blob/f96eec16756f3594fb21971e817989ef14638c10/qlora.py#L587-L591
✅ Currently I'm seeing:
Detected that training was already completed!
which is correct.
Unfortunately, the case when is_completed = True
also means:
:x: checkpoint_dir = None
:
https://github.com/artidoro/qlora/blob/f96eec16756f3594fb21971e817989ef14638c10/qlora.py#L562-L564
Therefore, checkpoint_dir
is not actually used and
:x: the model is reset to the default base model: https://github.com/artidoro/qlora/blob/f96eec16756f3594fb21971e817989ef14638c10/qlora.py#L317
which means the generated output will reflect the base model not what was trained.
Workaround
It all works after I remove this line:
https://github.com/artidoro/qlora/blob/f96eec16756f3594fb21971e817989ef14638c10/qlora.py#L564
- if is_completed: return None, True # already finished
I'm not certain why it is needed though.
Hope this helps save someone else the hours I just wasted thinking my training/dataset/etc was a problem when it was really not even using the trained model 😆
Are you able to resume training from the checkpoint with this?
Ah maybe not, you're thinking we also need https://github.com/artidoro/qlora/pull/79 ?
I was testing with a shorter max_steps
value so it finished earlier and didn't matter as much about resuming the checkpoints.
Now that fine-tune and generate are working, I'll be increasing max_steps
and likely would benefit from your Pull Request. Thanks!
🥰 Your "workaround" is a very good fix, it is clearly working and should be merged ASAP.
I can comfirm its working. I was trying to do a predict using a trained checkpoint. But the lora weights are not loaded. By removing the line mentioned above, the prediction is finally normal.