DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

how to understand the code for calculating rewards

Open lyzKF opened this issue 2 years ago • 1 comments

https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py#L148

  def compute_rewards(self, prompts, log_probs, ref_log_probs, reward_score,
                      action_mask):
      kl_divergence_estimate = -self.kl_ctl * (log_probs - ref_log_probs)
      rewards = kl_divergence_estimate
      start = prompts.shape[1] - 1
      ends = start + action_mask[:, start:].sum(1) + 1
      reward_clip = torch.clamp(reward_score, -self.clip_reward_value,  self.clip_reward_value)
      batch_size = log_probs.shape[0]
      for j in range(batch_size):
          rewards[j, start:ends[j]][-1] += reward_clip[j]

      return rewards

why should we calculate rewards like this? is there a theory to back it up?

lyzKF avatar Aug 22 '23 08:08 lyzKF

Hi! @lyzKF

I had the same confusion about this part. After some research, I found that this seems to be a common practice when applying PPO to LLM reinforcement learning.

From my understanding, the kl_divergence_estimate serves as a regularization-like penalty term. Its introduction aims to incorporate the divergence between the new model and the original model as part of the reward, with a purpose somewhat similar to the Clip operation in PPO Loss.

This practice of introducing KL penalty terms into the reward has been adopted in multiple OpenAI papers, such as this one: Fine-Tuning Language Models from Human Preferences

Their code implementation adopts a similar strategy.

As for reward_clip, I believe it can also be understood as a trick for stabilizing training.

This is my understanding, though I'm not entirely certain of its accuracy. If anyone has deeper insights, I'd welcome your guidance!

benjpau avatar Aug 28 '25 09:08 benjpau