direct-preference-optimization icon indicating copy to clipboard operation
direct-preference-optimization copied to clipboard

Why sometimes chosen_rewards become negaive?

Open DwarfWarriors opened this issue 2 years ago • 1 comments

Duiring DPO training for some datasets, chosen rewards recorded in logger(wandb, tensorboard etc) are always negative. Is it normal? Why did these circumstances happend?

DwarfWarriors avatar Sep 08 '23 02:09 DwarfWarriors

has been solved?

yiyepiaoling0715 avatar Jan 03 '25 13:01 yiyepiaoling0715