L2D
L2D copied to clipboard
Confused about PPO update
I'm a bit confused about the PPO update process. In the line 110:
The rewards in a single episode are normalized by subtracting the mean while divided by the variance. So why should the rewards be scaled? I found that though normalized, some truly bad rewards are scaled and important information is lost.