chainerrl ACER fails to learn mujoco

When i learn gym mujoco instead of cart pole by running train_acer_gym.py, it fails to learn and reward decreases as time step goes on (for HalfCheetah-v1, Reacher-v1).

Also, initialized policy of ACER has fluctuated reward at each start (for example, first episode reward of Reacher-v1 fluctuates from -500 to -100), so is there some way to get stable initialized policy? (first episode reward of PPO is around -100)

Nov 15 '17 13:11 ghost

Also, initialized policy of ACER has fluctuated reward at each start (for example, first episode reward of Reacher-v1 fluctuates from -500 to -100), so is there some way to get stable initialized policy? (first episode reward of PPO is around -100)

I guess it is because of different weight initialization. In examples/gym/train_ppo_gym.py, the layer that outputs the mean of the action distribution is scaled by 1e-2 (mean_wscale=1e-2). You may stabilize the initial policy for ACER too by specifying mean_wscale=1e-2.

If it increased the performance, please let us know so that we can improve the default settings.

Nov 16 '17 04:11 muupan

Another possible cause is the scale of observation space. examples/gym/train_ppo_gym.py normalizes observations so that mean=0, std=1, while examples/gym/train_acer_gym.py doesn't.

Nov 16 '17 04:11 muupan