pytorch-vsumm-reinforce icon indicating copy to clipboard operation
pytorch-vsumm-reinforce copied to clipboard

Mean instead of sum when computing the `expected_reward` by episode

Open sylvainma opened this issue 4 years ago • 1 comments

Hi, According to most of PyTorch REINFORCE algorithm implementations, the policy gradient loss should sum the log_probs on the trajectory (sum over t=1...T) instead of computing the mean. In the paper, this is correctly summed in equations 8/9/10. The only mean is over the N episodes. I believe this is a mistake in the code only.

https://github.com/KaiyangZhou/pytorch-vsumm-reinforce/blob/fdd03be93f090278424af789c120531e49aefa40/main.py#L131

Should be

expected_reward = log_probs.sum() * (reward - baselines[key]) 

The assumption is that the authors wanted to average instead of summing because videos have a different length.

Please, tell me if I am wrong. Thanks!

sylvainma avatar Apr 23 '20 21:04 sylvainma

To set a bit of context, REINFORCE implementations usually compute a loss L, so that once differentiated with autograd it matches the theoretical policy gradient of J(theta).

image

sylvainma avatar Apr 23 '20 22:04 sylvainma