Proximal-Policy-Optimization-Pytorch
Proximal-Policy-Optimization-Pytorch copied to clipboard
PPO value function clip
Hi, why do you use maximum instead of minimum to clipping value function loss? Suppose clippinng occurs, when v_pred_old < v_clipped < v_pred < R, or reversely, the loss will be larger than not clipped. Then why would it works to reduce the variability?