DeepSpeedExamples The loss in reward

https://github.com/microsoft/DeepSpeedExamples/blob/bae2667824974ac13dac28712462c14a2e594150/applications/DeepSpeed-Chat/training/utils/model/reward_model.py#L103

What if we change the loss to torch.log(torch.sigmoid(c_truncated_reward.mean() - r_truncated_reward.mean())) Instead of torch.log(torch.sigmoid(c_truncated_reward - r_truncated_reward)).mean()? I think in InstructGPT paper, it should be the latter?

Apr 23 '23 10:04 haochenglouis

Hi, do you mean use the average token reward as the score instead of the last token? For here, the mean should be out of the sigmod function since it is for different sequences (i.e., a batch)

Apr 24 '23 03:04 yaozhewei

Hi, do you mean use the average token reward as the score instead of the last token? For here, the mean should be out of the sigmod function since it is for different sequences (i.e., a batch)

Hi,

Very thanks for the reply. I mean in InstructGPT paper (https://arxiv.org/pdf/2203.02155.pdf Equation 1), r_theta(x,y) should be a scalar. From my understanding, in your repo, c_truncated_reward and r_truncated_reward are vectors, indicating the rewards of all the tokens in chosen and reject sequence. Thus it should be torch.sigmoid(c_truncated_reward.mean() - r_truncated_reward.mean()), i.e., the mean should be in the sigmoid function.

Not sure if I understand it right. Please correct me if I am wrong.

Best.

Apr 24 '23 06:04 haochenglouis

In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension.

Apr 24 '23 18:04 yaozhewei

In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension.

@yaozhewei Your explanation here should be wrong. There is already a for loop to expand the Batch dimension. https://github.com/microsoft/DeepSpeedExamples/blob/bae2667824974ac13dac28712462c14a2e594150/applications/DeepSpeed-Chat/training/utils/model/reward_model.py#L72-L103

Apr 25 '23 07:04 xf4fresh

In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension.

Thanks for the reply. I still have confusion. I printed the dimension of "chosen_reward" and "rejected_reward", which is 512 (the sequence length set in the args).

"c_truncated_reward" and "r_truncated_reward" are truncated vectors from "chosen_reward" and "rejected_reward whose dimension is ,for example, 76. Thus I think the vector is seq-length dimension rather than batch dimension.

Best.

Apr 25 '23 07:04 haochenglouis

In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension.

Thanks for the reply. I still have confusion. I printed the dimension of "chosen_reward" and "rejected_reward", which is 512 (the sequence length set in the args).

"c_truncated_reward" and "r_truncated_reward" are truncated vectors from "chosen_reward" and "rejected_reward whose dimension is ,for example, 76. Thus I think the vector is seq-length dimension rather than batch dimension.

Best.

Hi, @haochenglouis , I think you are right, and your suggestion is a viable solution.

While, I think it may be better to modify self.v_head to map seq-length dimension and embedding dimension into a scalar, not only embedding dimension. https://github.com/microsoft/DeepSpeedExamples/blob/bae2667824974ac13dac28712462c14a2e594150/applications/DeepSpeed-Chat/training/utils/model/reward_model.py#L56

Apr 25 '23 08:04 xf4fresh

In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension.

Thanks for the reply. I still have confusion. I printed the dimension of "chosen_reward" and "rejected_reward", which is 512 (the sequence length set in the args). "c_truncated_reward" and "r_truncated_reward" are truncated vectors from "chosen_reward" and "rejected_reward whose dimension is ,for example, 76. Thus I think the vector is seq-length dimension rather than batch dimension. Best.

Hi, @haochenglouis , I think you are right, and your suggestion is a viable solution.

While, I think it may be better to modify self.v_head to map seq-length dimension and embedding dimension into a scalar. not only embedding dimension.

https://github.com/microsoft/DeepSpeedExamples/blob/bae2667824974ac13dac28712462c14a2e594150/applications/DeepSpeed-Chat/training/utils/model/reward_model.py#L56

Yes. I agree with you. Thanks for the reply!

Apr 25 '23 08:04 haochenglouis

closed due to no followup for 2 weeks.

May 18 '23 21:05 yaozhewei

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

The loss in reward_model.py

DeepSpeedExamples DeepSpeedExamples copied to clipboard

The loss in reward_model.py

DeepSpeedExamples
DeepSpeedExamples copied to clipboard