torchtune Feature Request : ORPO

Hi,

First of all, thank you for this library! Very clean and I appreciate that all I need is pytorch!

I wanted to make an issue for the integration of ORPO, not needing to do SFT before the RLHF step is huge since it saves a lot of compute when training on preference data. Hoping it can be integrated into torch tune (with Lora support if possible)!

There is an existing integration into TRL

Thanks!

Apr 28 '24 19:04 nivibilla

@nivibilla thanks for opening this issue!

ORPO would indeed be a really nice addition to the library. It's not been at the top of our list if I was being honest, but maybe we should reconsider. Is this something you'd be open to adding? DPO and PPO (WIP) are both added by our awesome community members and if you'd be interested in adding ORPO - I'm happy to help to brainstorm and review the design, code etc.

Apr 29 '24 15:04 kartikayk

Hi @kartikayk

Im not that experienced in writing custom training loops. Mainly a huggingface user haha. I'd be no better than llama 3 70b attempting it 🤣

Apr 30 '24 21:04 nivibilla

I'm partly serious here, but why not train a codellama 70B using torchtune and then see if this gets you the right recipe :)

Apr 30 '24 23:04 kartikayk

@kartikayk finally got round to doing this and saw a new paper called simPO (simple preference optimisation) and it indeed was simpler to implement than orpo. Only real change being the loss function. And the paper claims some impressive stuff, beating all other offline rl methods.

I have a draft pr here #1036, draft since it's quite messy and I feel it has a lot of duplicate code. Also I haven't tested it at all.

Closing this issue in favour of #1037

May 30 '24 21:05 nivibilla

torchtune torchtune copied to clipboard

Feature Request : ORPO

torchtune
torchtune copied to clipboard