DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Step 2 reward model finetuning: how is the loss computed?

Open ridiculouz opened this issue 1 year ago • 1 comments

Hi there, I notice that in step 2, the reported scores (i.e. chosen_mean_scores and reject_mean_scores) are the same as the description:

... either the end token of the sequence or the first padding token ...

But in the loss computation, the implementation count on the mean log_sigmoid score among all divergence tokens, not the last ones. Is this an intended feature? In Anthropic's paper "A General Language Assistant as a Laboratory for Alignment", they only use the last token to compute reward, which seems like a more standard approach.

ridiculouz avatar Aug 17 '23 08:08 ridiculouz

This mean() is batch-wise, not token-wise

paraGONG avatar Oct 28 '23 12:10 paraGONG