robot-learning
robot-learning copied to clipboard
Critic q loss in PPO agent seems to be wrong
In line 282 of ppo_agent.py, the critic is trained using:
value_loss = self._config.value_loss_coeff * (ret - value_pred).pow(2).mean()
where ret
is computed as ret = adv + vpred[:-1]
.
This way of calculating return actually gives q_loss but the critic actually predicts v as here.
It seems like the critic is trained using q-loss but it's used to predict only state values. Could you clarify on this?
Thanks