direct-preference-optimization
direct-preference-optimization copied to clipboard
Why sometimes chosen_rewards become negaive?
Duiring DPO training for some datasets, chosen rewards recorded in logger(wandb, tensorboard etc) are always negative. Is it normal? Why did these circumstances happend?
has been solved?