awr
awr copied to clipboard
Why Normalization of vf
Hello,
thanks for the code, while I tried to re-implement the program, I find that there is one step to normalize value function vf here . It's implementated by v_predict = v(s; \theta) * (1-/gamma)
and critic update is implemented by
min_\theta [v(s; \theta) * (1-/gamma) - v_estimate ]^2
.
Is there any reason to normalize Value functions output, I tested to remove the normalization term and rescaled learning rate(by 1-gamma), looks there is no problem in HalfCheetah-v2.
It holds similar performance with original version.
Best,
the value scaling is just mainly a convention, i generally like to keep things normalized between 0 and 1. Training should work just as well without the normalization, but it might just need some tuning for the other hyper parameters like the stepsize.