torchtune Full-finetune DPO single device recipe

Full-finetune DPO single device recipe

Open SalmanMohammadi opened this issue 2 months ago • 0 comments

This should be straightforward. The main issue I see coming up is with compile - similar to how we attempt to compile the reference and policy model in our single device PPO recipe. Since the SelfAttentionLayer block is inlined and shared across the models, we're going to hit recompiles due to param.requires_grad. This might be acceptable in this case, since the recompiles won't be as severe as with PPO in it's current state #2066.

We might want to offer some kind of customization around the choice of reference policy model. The only constraint I can think of here is ensuring that both of the reference and policy models share a tokenizer - otherwise users should be able to freely experiment here.

Nov 27 '24 14:11 SalmanMohammadi

torchtune torchtune copied to clipboard

Full-finetune DPO single device recipe

torchtune
torchtune copied to clipboard