ACER fails to learn mujoco
When i learn gym mujoco instead of cart pole by running train_acer_gym.py, it fails to learn and reward decreases as time step goes on (for HalfCheetah-v1, Reacher-v1).
Also, initialized policy of ACER has fluctuated reward at each start (for example, first episode reward of Reacher-v1 fluctuates from -500 to -100), so is there some way to get stable initialized policy? (first episode reward of PPO is around -100)
Also, initialized policy of ACER has fluctuated reward at each start (for example, first episode reward of Reacher-v1 fluctuates from -500 to -100), so is there some way to get stable initialized policy? (first episode reward of PPO is around -100)
I guess it is because of different weight initialization. In examples/gym/train_ppo_gym.py, the layer that outputs the mean of the action distribution is scaled by 1e-2 (mean_wscale=1e-2). You may stabilize the initial policy for ACER too by specifying mean_wscale=1e-2.
If it increased the performance, please let us know so that we can improve the default settings.
Another possible cause is the scale of observation space. examples/gym/train_ppo_gym.py normalizes observations so that mean=0, std=1, while examples/gym/train_acer_gym.py doesn't.