DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Step 2 reward model finetuning: how is the loss computed?
Hi there,
I notice that in step 2, the reported scores (i.e. chosen_mean_scores
and reject_mean_scores
) are the same as the description:
... either the end token of the sequence or the first padding token ...
But in the loss computation, the implementation count on the mean log_sigmoid score among all divergence tokens, not the last ones. Is this an intended feature? In Anthropic's paper "A General Language Assistant as a Laboratory for Alignment", they only use the last token to compute reward, which seems like a more standard approach.
This mean() is batch-wise, not token-wise