direct-preference-optimization icon indicating copy to clipboard operation
direct-preference-optimization copied to clipboard

Question about average_log_prob

Open LSX-Sneakerprogrammer opened this issue 2 years ago • 9 comments

Hi, I see there is a bool variable in _get_batch_logps of trainers.py to control whether get the average log probability or not. And I have two questions.

  1. Did you do experiments on this to see which one performs better?
  2. If I choose to get average log probability, I consider the pad_to_length function needed to turn off. Is that right?

Hope you could help me on these questions, thanks a lot!

LSX-Sneakerprogrammer avatar Oct 24 '23 05:10 LSX-Sneakerprogrammer

  1. If I choose to get average log probability, I consider the pad_to_length function needed to turn off. Is that right?

No. Padded tokens will not be counted. see here https://github.com/eric-mitchell/direct-preference-optimization/blob/main/trainers.py#L96C5-L104

alex-ht avatar Oct 27 '23 02:10 alex-ht

  1. Did you do experiments on this to see which one performs better?

~~I have already tried specifying average_log_prob=True, but the beta value needs a adjustment. For example, if the sum of logits is divided by the average length of 100 tokens roughly, then the beta needs to be increased by 100 times.~~

average_log_prob=False is better

alex-ht avatar Oct 27 '23 02:10 alex-ht

  1. Did you do experiments on this to see which one performs better?

~I have already tried specifying average_log_prob=True, but the beta value needs a adjustment. For example, if the sum of logits is divided by the average length of 100 tokens roughly, then the beta needs to be increased by 100 times.~

average_log_prob=False is better

Thanks for your reply! I try average_log_prob=False and it seems the model is more likely to generate long responses compared to original. I want to avoid this problem and try average_log_prob=True, but after training, the model turns to generate with repeat words. Have you meet this problem and know how to solve it? Thanks a lot!

LSX-Sneakerprogrammer avatar Dec 17 '23 14:12 LSX-Sneakerprogrammer

I ran into the same issue a few months ago and didn't have any success with average_log_prob=True -- the model became very degenerative. Ultimately I left average_log_prob=False and had to add some extra tricks to keep DPO from teaching the model to write very long responses.

dblakely avatar Jan 18 '24 19:01 dblakely

Hi all, Could somebody please explain to me the reason why average_log_prob=False make model to generate longer responses? Any hints/clarifications are appreciated.

longbowzhang avatar Jan 22 '24 10:01 longbowzhang

Hi all, Could somebody please explain to me the reason why average_log_prob=False make model to generate longer responses? Any hints/clarifications are appreciated.

I've noticed that the model tends to generate longer responses as training progresses. I suspect that setting average_log_prob=True might slow down this process compared to when it's set to False. @longbowzhang , have you not encountered this issue when you've set it to True?

yata0 avatar Mar 06 '24 07:03 yata0

I ran into the same issue a few months ago and didn't have any success with average_log_prob=True -- the model became very degenerative. Ultimately I left average_log_prob=False and had to add some extra tricks to keep DPO from teaching the model to write very long responses.

@dblakely Could you please share " extra tricks"?

yata0 avatar Mar 06 '24 07:03 yata0

Hey @yata0, the author mentioned some ideas here and I tried each of those 4 suggestions. All of them helped to some extent. To "normalize" the data lengths, I simply dropped a bunch of the longest positive examples from my dataset to bring the length distribution of positives and negatives closer together (a big part of the problem in my case was simply that the positive examples in my dataset were on average a fair amount longer than the negatives and DPO was over-optimizing that trait).

dblakely avatar Mar 07 '24 13:03 dblakely

Hey @yata0, the author mentioned some ideas here and I tried each of those 4 suggestions. All of them helped to some extent. To "normalize" the data lengths, I simply dropped a bunch of the longest positive examples from my dataset to bring the length distribution of positives and negatives closer together (a big part of the problem in my case was simply that the positive examples in my dataset were on average a fair amount longer than the negatives and DPO was over-optimizing that trait).

Thanks!

yata0 avatar Mar 08 '24 03:03 yata0