Incorrect calculation of generalized advantage estimates in PPO

Open zhezherun opened this issue 1 year ago • 0 comments

The following code in PPOAgent.compute_advantages ignores value predictions for final observations in the trajectory and instead passes one-before-last values to the generalized_advantage_estimation function twice:

    # Arg value_preds was appended with final next_step value. Make tensors
    #   next_value_preds by stripping first and last elements respectively.
    value_preds = value_preds[:, :-1]
    if self._use_gae:
      advantages = value_ops.generalized_advantage_estimation(
          values=value_preds,
          final_value=value_preds[:, -1],
          rewards=rewards,
          discounts=discounts,
          td_lambda=self._lambda,
          time_major=False,
      )

Instead, final_value should be extracted before value_preds are stripped, e.g.:

    final_value_preds = value_preds[:, -1]
    value_preds = value_preds[:, :-1]
    if self._use_gae:
      advantages = value_ops.generalized_advantage_estimation(
          values=value_preds,
          final_value=final_value_preds,
          rewards=rewards,
          discounts=discounts,
          td_lambda=self._lambda,
          time_major=False,
      )

Also, the comment about next_value_preds doesn't match the code so it could be improved.

Feb 12 '25 13:02 zhezherun