DeepSpeedExamples
DeepSpeedExamples copied to clipboard
how to understand the code for calculating rewards
https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py#L148
def compute_rewards(self, prompts, log_probs, ref_log_probs, reward_score,
action_mask):
kl_divergence_estimate = -self.kl_ctl * (log_probs - ref_log_probs)
rewards = kl_divergence_estimate
start = prompts.shape[1] - 1
ends = start + action_mask[:, start:].sum(1) + 1
reward_clip = torch.clamp(reward_score, -self.clip_reward_value, self.clip_reward_value)
batch_size = log_probs.shape[0]
for j in range(batch_size):
rewards[j, start:ends[j]][-1] += reward_clip[j]
return rewards
why should we calculate rewards like this? is there a theory to back it up?
Hi! @lyzKF
I had the same confusion about this part. After some research, I found that this seems to be a common practice when applying PPO to LLM reinforcement learning.
From my understanding, the kl_divergence_estimate serves as a regularization-like penalty term. Its introduction aims to incorporate the divergence between the new model and the original model as part of the reward, with a purpose somewhat similar to the Clip operation in PPO Loss.
This practice of introducing KL penalty terms into the reward has been adopted in multiple OpenAI papers, such as this one: Fine-Tuning Language Models from Human Preferences
Their code implementation adopts a similar strategy.
As for reward_clip, I believe it can also be understood as a trick for stabilizing training.
This is my understanding, though I'm not entirely certain of its accuracy. If anyone has deeper insights, I'd welcome your guidance!