direct-preference-optimization Why sometimes chosen

Why sometimes chosen_rewards become negaive?

Open DwarfWarriors opened this issue 2 years ago • 1 comments

Duiring DPO training for some datasets, chosen rewards recorded in logger(wandb, tensorboard etc) are always negative. Is it normal? Why did these circumstances happend?

Sep 08 '23 02:09 DwarfWarriors

has been solved?

Jan 03 '25 13:01 yiyepiaoling0715

direct-preference-optimization direct-preference-optimization copied to clipboard

Why sometimes chosen_rewards become negaive?

direct-preference-optimization
direct-preference-optimization copied to clipboard