trl icon indicating copy to clipboard operation
trl copied to clipboard

Feature Request: Self-Improving Robust Preference Optimization (SRPO)

Open duyvuleo opened this issue 1 year ago • 1 comments

Hi,

This new paper (https://arxiv.org/pdf/2406.01660v2) looks very compelling.

For offline RLHF, SRPO looks outperforming DPO with OOD tasks.

Is there a plan to implement this in TRL?

I could not find SRPO implementation on Github yet.

Thanks!

duyvuleo avatar Jun 07 '24 02:06 duyvuleo

image

The main work is on sample construction, which has changed from the original estimation of (x, yl)+(x, yw) to (x+yl, yl)+(x+yl, yw)+(x+yw, yl)+(x+yw, yw), resulting in a significant increase in sample length, feedforward steps, and computational complexity. In addition, although the explanation of the model strategy remains the same, the actual input distribution has changed. In reality, when we expect a better output, we only know X. This is equivalent to conducting multiple rounds of estimation. In order to maintain the original distribution and introduce a new distribution, we need to design at least two stages or even N stages (this method actually allows for multiple generations). It seems more concise to address this issue by introducing a reward model.

Trangle avatar Jun 07 '24 08:06 Trangle

I'm planning to attempt to add this to TRL. Hope to have a PR ready relatively soon!

frasermince avatar Jun 12 '24 14:06 frasermince