RAFT/DPO/RLHF ?
@dvgodoy Sorry couldn't find the discussions sections.. Can you also provide some guides on how you would do RAFT and also RLHF? there are tons of material around them on the web but this coming from you in accordance with the way you have provided your content, would be awesome. Thanks
HI @vahidreza
Thanks for supporting my work! I don't have any materials of my own but, if you want to explore preference-tuning following what you learned in my book, I'd recommend to take a closer look at other Hugging Face Trainers in the trl library.
The library is pretty standard, so you'd mostly switching from SFTTrainer and SFTConfig to DPOTrainer and DPOConfig (or any other trainer you may like). The general workflow (quantization, LoRA adapters) works the same. Of course, you'd also need an appropriate dataset for the task.
I'd recommend checking the DPO trainer documentation (https://huggingface.co/docs/trl/main/en/dpo_trainer) first and then you can check Philip Schmid's post (https://www.philschmid.de/dpo-align-llms-in-2024-with-trl) for more details - it includes an example that resembles the workflow in the book, so it should look familiar and hopefully more easy to follow. I hope it helps to get you started!
Best, Daniel
Thanks a lot @dvgodoy What about RAFT as a proposed new method for find tuning based on this paper?
Hi @vahidreza
I couldn't really find anything other than the official repo of this particular RAFT technique, I'm sorry. It seems like it still isn't mainstream enough to be implemented in Hugging Face's TRL.
Best. Daniel
@dvgodoy thanks anyway. I'll try to implement it anyway and will see how it will go.
Thanks a lot again