alignment-handbook DPO loss

trafficstars

I am training DPO with lora, the loss has weird behavior: will decrease sharply at the beginning of each epoch. I wonder if you have same issue before? Screenshot 2023-11-17 at 12 28 28

Nov 17 '23 17:11 JiuhaiChen

It seems that full finetuning has this problem, while lora doesn't. Could you share the yaml training configuration? Also how many GPUs are you using?

Nov 17 '23 18:11 ChenDRAG

Thanks for your reply. I don't try the full model fine-tuning. For the lora, i only changed: gradient_accumulation_steps: 1, per_device_train_batch_size: 16, per_device_eval_batch_size: 4, save_strategy: "epoch". I am using the 8 A6000. Also, i am not sure if you observed the eval loss is increasing in the training.

Nov 17 '23 18:11 JiuhaiChen

Sorry, I did not encounter this problem. Do you use the official binary dataset? What is your base model? Though I don't think they matter that much.

Nov 17 '23 18:11 ChenDRAG

Yeah, i agree eval loss does not matter. For the lora, how many cards you are using?

Nov 17 '23 18:11 JiuhaiChen

8 A40 cards. My new experiments also encounter this problem.

Difference between the two configurations previous

bath size 4 accumulation 2 cards 8 lr 1e-7

new batch size 8 accumulation 1 cards 8 lr 1e-4

I think the main change it I increase lr a lot, are you sure you use a lr=1e-7 in your experiments?

Nov 18 '23 02:11 ChenDRAG

I’m currently training a lora across all mistral modules with the standard setting with the exception of no eval, and a single batch size on a 3090. My loss is hitting .29 and it’s only been training for 180 steps. (.4 epochs).

edit: Epoch .52, 210 steps in, the loss is at .18 and rewards/accuracy is 1.0.

Nov 20 '23 11:11 NicolasMejiaPetit

quite weird, i just trained the DPO and my loss is normal across epochs, pretty much similar to the results shared on hf model card. how about rebase and try again ? definitively .29 or lower is because the model is seeing the right prediction token somehow.

Nov 21 '23 12:11 fblgit

In most cases, DPO will only train one epoch, and more epochs will cause a performance crash. At the same time, smaller learning rates generally lead to better results, and it is recommended that you can start trying from 5e-7.

Jan 24 '25 07:01 qychen2001

alignment-handbook alignment-handbook copied to clipboard

DPO loss

alignment-handbook
alignment-handbook copied to clipboard