rl_reach
rl_reach copied to clipboard
ddpg + her
I was wondering whether you have tried to train a model with ddpg + her. I had some success training sac + her but with ddpg my arm "folds" itself by eventually setting q to joint limits.
If you have, maybe you could share some thoughts on it. Thanks in advance.
*Addendum: Even with only ddpg, the robot arm moves into said position and it is very difficult for him to change its position from there. I also implemented ddpg from scratch, corroborated it with "pendulum-v0" gym sample environment and tried it with my robot environment, but the result was similar. After around 10 optimization steps my cumulative reward begins to (slowly) diverge towards the negative. Any insight would be much appreciated.
Hi Stefan,
I did a quick check and I didn't encounter any problem in training a model with DDPG. I'm not sure what you mean by "fold", can you attach a screenshot to illustrate?
Can you also give more details on your training environment (i.e. observation shape, reward function, action shape, fixed / random goal, fixed / moving goal, action space) and DDPG hyperparameters?
By folding I mean that the robot goes into a configuration where all joint angles are at max or at min and it cant get out of that configuration.
Here is a picture:

I think it is best to give you a link to my repository. https://github.com/stefanwanckel/DRL/tree/main/Tryhard
I adapted my GymEnv structure from your repository, so it is pretty similar. I stopped tracking your repository though, so I wills stick with an older version. I am using the train.py function provided by stable-baselines3-zoo to trian my models.
The init for the environment on the picture are look like this:
id='ur5e_reacher-v5', entry_point='ur5e_env.envs.ur5e_env:Ur5eEnv', max_episode_steps=2000, kwargs={ 'random_position' : False, 'random_orientation': False, 'moving_target': False, 'target_type': "sphere", 'goal_oriented' : True, 'obs_type' : 1, 'reward_type' : 13, 'action_type' : 1, 'joint_limits' : "small", 'action_min': [-1, -1, -1, -1, -1, -1], 'action_max': [1, 1, 1, 1, 1, 1], 'alpha_reward': 0.1, 'reward_coeff': 1, 'action_scale': 1, 'eps' : 0.1, 'sim_rep' : 5, 'action_mode' : "force"
The hyperparams look like this:
ur5e_reacher-v5: n_timesteps: !!float 1e7 policy: 'MlpPolicy' model_class: 'ddpg' n_sampled_goal: 4 goal_selection_strategy: 'future' buffer_size: 1000000 batch_size: 128 gamma: 0.95 learning_rate: !!float 1e-3 noise_type: 'normal' noise_std: 0.2 policy_kwargs: "dict(net_arch=[512, 512, 512])" online_sampling: True #max_episode_length: 100
I don't have time for a case by case troubleshooting but I can suggest a few things:
- check that you can train with DDPG and the most simple env in rl_reach (fixed target) and adapt your custom environment from a working case
- use a dense reward function
- plot some useful metrics during evaluation, such as the reward, action, joint position, pybullet action (see https://github.com/PierreExeter/rl_reach/blob/master/code/scripts/plot_episode_eval_log.py)
- plot the reward vs timesteps and check that it is increasing