robot-learning Critic q loss in PPO agent seems to be wrong

Critic q loss in PPO agent seems to be wrong

Open Asad-Shahid opened this issue 2 years ago • 0 comments

In line 282 of ppo_agent.py, the critic is trained using:

value_loss = self._config.value_loss_coeff * (ret - value_pred).pow(2).mean()

where ret is computed as ret = adv + vpred[:-1].

This way of calculating return actually gives q_loss but the critic actually predicts v as here.

It seems like the critic is trained using q-loss but it's used to predict only state values. Could you clarify on this?

Thanks

Jul 08 '22 09:07 Asad-Shahid