torchtune
torchtune copied to clipboard
Full-finetune DPO single device recipe
This should be straightforward. The main issue I see coming up is with compile - similar to how we attempt to compile the reference and policy model in our single device PPO recipe. Since the SelfAttentionLayer
block is inlined and shared across the models, we're going to hit recompiles due to param.requires_grad
. This might be acceptable in this case, since the recompiles won't be as severe as with PPO in it's current state #2066.
We might want to offer some kind of customization around the choice of reference policy model. The only constraint I can think of here is ensuring that both of the reference and policy models share a tokenizer - otherwise users should be able to freely experiment here.