qlora icon indicating copy to clipboard operation
qlora copied to clipboard

[bug] Completed model does not load from checkpoint / generate produces same as base model

Open Glavin001 opened this issue 1 year ago • 3 comments

Prerequisite

  • Ensure checkpoints save correctly: Apply https://github.com/artidoro/qlora/pull/44 to fix https://github.com/artidoro/qlora/issues/38 and https://github.com/artidoro/qlora/issues/41

Problem

Model has finished training and the output looks like this:

image

  • checkpoint-#/adapter_model/ directory exists
    • adapter_model.bin is MB not 443 bytes.
  • completed file exists

We want to load from the checkpoint here: https://github.com/artidoro/qlora/blob/f96eec16756f3594fb21971e817989ef14638c10/qlora.py#L309-L311

which needs checkpoint_dir to be set.

checkpoint_dir comes from get_last_checkpoint: https://github.com/artidoro/qlora/blob/f96eec16756f3594fb21971e817989ef14638c10/qlora.py#L587-L591

✅ Currently I'm seeing:

Detected that training was already completed!

which is correct.

Unfortunately, the case when is_completed = True also means: :x: checkpoint_dir = None:

https://github.com/artidoro/qlora/blob/f96eec16756f3594fb21971e817989ef14638c10/qlora.py#L562-L564

Therefore, checkpoint_dir is not actually used and :x: the model is reset to the default base model: https://github.com/artidoro/qlora/blob/f96eec16756f3594fb21971e817989ef14638c10/qlora.py#L317

which means the generated output will reflect the base model not what was trained.

Workaround

It all works after I remove this line:

https://github.com/artidoro/qlora/blob/f96eec16756f3594fb21971e817989ef14638c10/qlora.py#L564

-        if is_completed: return None, True # already finished

I'm not certain why it is needed though.

Hope this helps save someone else the hours I just wasted thinking my training/dataset/etc was a problem when it was really not even using the trained model 😆

Glavin001 avatar May 29 '23 07:05 Glavin001

Are you able to resume training from the checkpoint with this?

KKcorps avatar May 29 '23 08:05 KKcorps

Ah maybe not, you're thinking we also need https://github.com/artidoro/qlora/pull/79 ?

I was testing with a shorter max_steps value so it finished earlier and didn't matter as much about resuming the checkpoints.

Now that fine-tune and generate are working, I'll be increasing max_steps and likely would benefit from your Pull Request. Thanks!

Glavin001 avatar May 29 '23 16:05 Glavin001

🥰 Your "workaround" is a very good fix, it is clearly working and should be merged ASAP.

I can comfirm its working. I was trying to do a predict using a trained checkpoint. But the lora weights are not loaded. By removing the line mentioned above, the prediction is finally normal.

Maxwell-Lyu avatar May 30 '23 15:05 Maxwell-Lyu