trl KtoTrainer: BCO improvements

I recently experimented quite a bit with the BCO loss type in the KTO Trainer.
This PR includes some changes that helped me to successfully and effectively run BCO on multiple GPUs (on Azure in this case):

When using the BCO loss type there is no need for the KL dataset anymore. By skipping the creation of the KL dataset we save time and memory.
Do not assert both desired and undesired data in each per-device mini batch. With this it was impossible to train large models that only allow a per-device batch size of 1. Also KTO and BCO are supposed to work with unpaired preference data and different amounts of desired or undesired examples. In my experiments BCO proofed to work well without this assumption and also I cannot find it mentioned anywhere in the paper.
When checkpointing, also save the RunningMoments object which is used in BCO to calculate the $\delta$ mean reward value. If this is not saved, it will corrupt the whole training process when restarting from a checkpoint.

Please have a look at these and let me know if you can approve the solutions implemented here or if you prefer other solutions to the issues described above

@kashif @lewtun

Jul 04 '24 17:07 claralp

thanks @claralp just looking again do you think it makes sense to split the BCO and KTO trainers for a cleaner implementation?

Jul 04 '24 18:07 kashif

@kashif that would also make sense. But then some shared functions (e.g. _tokenize, _process_tokens) need to move to a shared place, maybe trainer/utils.py

Jul 04 '24 18:07 claralp

yeah at least for me these if... else make things confusing...

Jul 04 '24 18:07 kashif