pytorch-vsumm-reinforce
pytorch-vsumm-reinforce copied to clipboard
Mean instead of sum when computing the `expected_reward` by episode
Hi,
According to most of PyTorch REINFORCE algorithm implementations, the policy gradient loss should sum the log_probs
on the trajectory (sum over t=1...T) instead of computing the mean. In the paper, this is correctly summed in equations 8/9/10. The only mean is over the N episodes. I believe this is a mistake in the code only.
https://github.com/KaiyangZhou/pytorch-vsumm-reinforce/blob/fdd03be93f090278424af789c120531e49aefa40/main.py#L131
Should be
expected_reward = log_probs.sum() * (reward - baselines[key])
The assumption is that the authors wanted to average instead of summing because videos have a different length.
Please, tell me if I am wrong. Thanks!
To set a bit of context, REINFORCE implementations usually compute a loss L
, so that once differentiated with autograd it matches the theoretical policy gradient of J(theta)
.