DeepSpeedExamples
DeepSpeedExamples copied to clipboard
The loss in reward_model.py
https://github.com/microsoft/DeepSpeedExamples/blob/bae2667824974ac13dac28712462c14a2e594150/applications/DeepSpeed-Chat/training/utils/model/reward_model.py#L103
What if we change the loss to torch.log(torch.sigmoid(c_truncated_reward.mean() - r_truncated_reward.mean())) Instead of torch.log(torch.sigmoid(c_truncated_reward - r_truncated_reward)).mean()? I think in InstructGPT paper, it should be the latter?
Hi, do you mean use the average token reward as the score instead of the last token? For here, the mean should be out of the sigmod function since it is for different sequences (i.e., a batch)
Hi, do you mean use the average token reward as the score instead of the last token? For here, the mean should be out of the sigmod function since it is for different sequences (i.e., a batch)
Hi,
Very thanks for the reply. I mean in InstructGPT paper (https://arxiv.org/pdf/2203.02155.pdf Equation 1), r_theta(x,y) should be a scalar. From my understanding, in your repo, c_truncated_reward and r_truncated_reward are vectors, indicating the rewards of all the tokens in chosen and reject sequence. Thus it should be torch.sigmoid(c_truncated_reward.mean() - r_truncated_reward.mean()), i.e., the mean should be in the sigmoid function.
Not sure if I understand it right. Please correct me if I am wrong.
Best.
In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension.
In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension.
@yaozhewei Your explanation here should be wrong. There is already a for loop to expand the Batch dimension. https://github.com/microsoft/DeepSpeedExamples/blob/bae2667824974ac13dac28712462c14a2e594150/applications/DeepSpeed-Chat/training/utils/model/reward_model.py#L72-L103
In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension.
Thanks for the reply. I still have confusion. I printed the dimension of "chosen_reward" and "rejected_reward", which is 512 (the sequence length set in the args).
"c_truncated_reward" and "r_truncated_reward" are truncated vectors from "chosen_reward" and "rejected_reward whose dimension is ,for example, 76. Thus I think the vector is seq-length dimension rather than batch dimension.
Best.
In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension.
Thanks for the reply. I still have confusion. I printed the dimension of "chosen_reward" and "rejected_reward", which is 512 (the sequence length set in the args).
"c_truncated_reward" and "r_truncated_reward" are truncated vectors from "chosen_reward" and "rejected_reward whose dimension is ,for example, 76. Thus I think the vector is seq-length dimension rather than batch dimension.
Best.
Hi, @haochenglouis , I think you are right, and your suggestion is a viable solution.
While, I think it may be better to modify self.v_head
to map seq-length dimension and embedding dimension into a scalar, not only embedding dimension.
https://github.com/microsoft/DeepSpeedExamples/blob/bae2667824974ac13dac28712462c14a2e594150/applications/DeepSpeed-Chat/training/utils/model/reward_model.py#L56
In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension.
Thanks for the reply. I still have confusion. I printed the dimension of "chosen_reward" and "rejected_reward", which is 512 (the sequence length set in the args). "c_truncated_reward" and "r_truncated_reward" are truncated vectors from "chosen_reward" and "rejected_reward whose dimension is ,for example, 76. Thus I think the vector is seq-length dimension rather than batch dimension. Best.
Hi, @haochenglouis , I think you are right, and your suggestion is a viable solution.
While, I think it may be better to modify
self.v_head
to map seq-length dimension and embedding dimension into a scalar. not only embedding dimension.https://github.com/microsoft/DeepSpeedExamples/blob/bae2667824974ac13dac28712462c14a2e594150/applications/DeepSpeed-Chat/training/utils/model/reward_model.py#L56
Yes. I agree with you. Thanks for the reply!
closed due to no followup for 2 weeks.