D4PG
D4PG copied to clipboard
How to set the Vmin and Vmax for the other mujoco tasks?
Does it based on the cumulative reward?
Yes it is based on the expected cumulative reward from the environment. For Pendulum, we can never get positive rewards so its easy to set Vmax to 0. To find Vmin its really a bit of trial and error as its simply another hyperparameter. The strategy I used to find this lower bound empirically is to run a random agent in the environment and log the reward at each state. You can then estimate the minimum value simply by calculating the cumulative future rewards from each state. This will give you a good lower bound on your value range as it is unlikely you will ever have an agent which performs worse than a random agent, as soon as you start training the values should increase. You could use this strategy to find bounds for the other environments like Mujoco.
Does it based on the cumulative reward?
Did you find the most appropriate parameters for mujoco tasks?