trl
trl copied to clipboard
trl with seq2seq
Hello, Thanks for releasing this code.
I would like to use this algorithm with a trained seq2seq (x -> y) model. I would initialize the active model and ref model with the trained seq2seq. Then I would proceed like this:
roll-out: x -> active model -> outputs y evaluation: get reward for y optimization: x -> active model -> force y as input, get decoder logprobs x -> ref model -> force y as input, get decoder logprobs then compute kl + reward etc.
Does that make sense to proceed as such ?
Thank you for your feedback
I think that makes sense. I have not used a seq2seq model, yet. So you might want to start with a decoder only model that should work and then compare the results to your enc-dec approach. Good luck!