baselines icon indicating copy to clipboard operation
baselines copied to clipboard

Understanding normalization of advantage function

Open mbcel opened this issue 7 years ago • 7 comments

I am wondering about the normalization of the advantage function in PPO. Before training on a batch the mean of the advantage function is subtracted and it's divided by its std.

To me it makes intuitively sense to divide the advantage function by its std since then we always have the same magnitude in gradients in each update. However, I totally don't understand why it would be beneficial to substract the mean from the advantage function. In my understanding that would introduce some form of bias since now for example values that were greater 0 and that should be encouraged to do more often can now potential fall below 0 if they are smaller than the mean -> these values are falsely trained to do less often.

So whats the intuition behind substracting the mean? And does it really improve learning?

mbcel avatar Aug 27 '18 16:08 mbcel

if you sample enough data,it's all ok. but in practice we can't, and so it's important to substract it.

initial-h avatar Aug 30 '18 13:08 initial-h

That's still not clear for me.

Independent of the dataset size or used batch size the advantage function should give me positive advantages for things I should do more often or negative advantages for things I should do less often (that's why we substract the baseline/value function), or am I wrong?? Further substracting the mean does change this behaviour in my view.

mbcel avatar Sep 01 '18 08:09 mbcel

@marcel1991 I was also wondering about this. Did you gain any insight into the benefits of normalising/ not normalising in PPO?

lancerane avatar Nov 16 '18 18:11 lancerane

I still don't understand what is the adv of normalizing the advantages.

shtse8 avatar Aug 27 '20 02:08 shtse8

https://arxiv.org/pdf/2006.05990.pdf concludes that "per-minibatch advantage normalization (C67) seems not to affect the performance too much (Fig. 35)"

ChenDRAG avatar Mar 10 '21 04:03 ChenDRAG

Doesn't subtracting the mean from the advantages have the effect of an entropy regularizer?

Ignoring the clipping, the objective is logp * (adv - mean) / std = logp * adv / std - logp * mean / std. The first term is the normalized policy gradient. The second term makes all actions in the batch less likely, effectively policy entropy.

If the batch size is small so the mean and std fluctuate, then the entropy would be increased a bit more on some batches than on others, but it probably doesn't make a big difference.

danijar avatar Jan 15 '22 01:01 danijar

@danijar That's actually an interesting perspective. But the mean can also be negative, right? If that's the case, the second term actually makes all actions more likely. So it's a bit unclear a priori what the second term is gonna do.

zhihanyang2022 avatar Jan 15 '22 04:01 zhihanyang2022