SLM-Lab Real recurrent policy supported

Real recurrent policy supported

Open yangysc opened this issue 4 years ago • 2 comments

Are you requesting a feature or an implementation?

To handle the partial MDP task, the recurrent policy is currently quite popular. We need to add a lstm layer after the original conv (or mlp) policy, and store the hidden states for training. But in SLM-lab, the RecurrentNet class has limited ablities. It is more like a concatenation of series of input states, and the hidden states of rnn are not stored, which weanken the recurrent policy seriously. For example, I used it with the default parameters to solve the cartpole task, and it failed.

python run_lab.py slm_lab/spec/experimental/ppo/ppo_cartpole.json ppo_rnn_separate_cartpole  train

Even I changed the max_frame parameter of the env from 500 to 50000, the RecurrentNet still couldn't work.

[2019-07-14 21:11:38,098 PID:18904 INFO logger.py info] Session 1 done
[2019-07-14 21:11:38,287 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [train_df metrics] final_return_ma: 58.26  strength: 35.4753  max_strength: 178.14  final_strength: 37.14  sample_efficiency: 9.07107e-05  training_efficiency: 6.71198e-06  stability: 0.846315
[2019-07-14 21:11:38,468 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df] epi: 647  t: 126  wall_t: 655  opt_step: 997120  frame: 49859  fps: 76.1206  total_reward: 126  total_reward_ma: 88.02  loss: 0.610099  lr: 1.44304e-37  explore_var: nan  entropy_coef: 0.001  entropy: 0.0258675  grad_norm: nan
[2019-07-14 21:11:38,835 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df] epi: 648  t: 54  wall_t: 656  opt_step: 997760  frame: 49913  fps: 76.0869  total_reward: 54  total_reward_ma: 88.02  loss: 0.554544  lr: 1.44304e-37  explore_var: nan  entropy_coef: 0.001  entropy: 0.217777  grad_norm: nan
[2019-07-14 21:11:38,835 PID:18906 INFO __init__.py log_metrics] Trial 0 session 3 ppo_rnn_separate_cartpole_t0_s3 [eval_df metrics] final_return_ma: 79.4461  strength: 57.5861  max_strength: 159.64  final_strength: 54.39  sample_efficiency: 9.59096e-05  training_efficiency: 4.81586e-06  stability: 0.899133
[2019-07-14 21:11:38,836 PID:18906 INFO logger.py info] Session 3 done
[2019-07-14 21:11:39,296 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [eval_df metrics] final_return_ma: 61.299  strength: 39.439  max_strength: 178.14  final_strength: 32.64  sample_efficiency: 0.000120629  training_efficiency: 6.06361e-06  stability: 0.84144
[2019-07-14 21:11:39,794 PID:18905 INFO logger.py info] Running eval ckpt
[2019-07-14 21:11:39,939 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df] epi: 649  t: 0  wall_t: 657  opt_step: 999680  frame: 50000  fps: 76.1035  total_reward: 84.25  total_reward_ma: 78.0294  loss: 2.42707  lr: 1.44304e-37  explore_var: nan  entropy_coef: 0.001  entropy: 0.135592  grad_norm: nan
[2019-07-14 21:11:40,234 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [eval_df metrics] final_return_ma: 61.299  strength: 39.439  max_strength: 178.14  final_strength: 32.64  sample_efficiency: 0.000120629  training_efficiency: 6.06361e-06  stability: 0.84144
[2019-07-14 21:11:40,236 PID:18903 INFO logger.py info] Session 0 done
[2019-07-14 21:11:41,480 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df metrics] final_return_ma: 88.02  strength: 55.0476  max_strength: 178.14  final_strength: 32.14  sample_efficiency: 8.00063e-05  training_efficiency: 4.46721e-06  stability: 0.708828
[2019-07-14 21:11:42,347 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df metrics] final_return_ma: 78.0294  strength: 56.1694  max_strength: 84.39  final_strength: 62.39  sample_efficiency: 8.97979e-05  training_efficiency: 4.50698e-06  stability: 0.860915
[2019-07-14 21:11:43,242 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df metrics] final_return_ma: 78.0294  strength: 56.1694  max_strength: 84.39  final_strength: 62.39  sample_efficiency: 8.97979e-05  training_efficiency: 4.50698e-06  stability: 0.860915
[2019-07-14 21:11:43,243 PID:18905 INFO logger.py info] Session 2 done
[2019-07-14 21:11:49,818 PID:18839 INFO analysis.py analyze_trial] All trial data zipped to data/ppo_rnn_separate_cartpole_2019_07_14_210040.zip
[2019-07-14 21:11:49,818 PID:18839 INFO logger.py info] Trial 0 done

If you have any suggested solutions

I'm afraid to cause more bugs, so I'm sorry not able to add this new feature. But I provide two examples. OpenAI baselines pytorch-a2c-ppo-acktr-gail

With this feature, I believe SLM-Lab will be the top-1 in pytorch.

Thanks in advance!

Jul 14 '19 13:07 yangysc

Hi @yangysc , thanks for testing the RNN. The shared network from the spec ppo_rnn_shared_cartpole works slightly better because there are less hyperparameters to run. It yields slightly better results:

[2019-07-14 22:53:07,321 PID:73583 INFO __init__.py log_summary] Trial 0 session 0 ppo_rnn_shared_cartpole_t0_s0 [train_df] epi: 169  t: 200  wall_t: 360  opt_step: 234560  frame: 23465  fps: 65.1806  total_reward: 200  total_reward_ma: 173.03  loss: 0.0292752  lr: 4.55652e-17  explore_var: nan  entropy_coef: 0.001  entropy: 0.112986  grad_norm: nan
[2019-07-14 22:53:10,775 PID:73583 INFO __init__.py log_summary] Trial 0 session 0 ppo_rnn_shared_cartpole_t0_s0 [train_df] epi: 170  t: 185  wall_t: 363  opt_step: 236480  frame: 23650  fps: 65.1515  total_reward: 185  total_reward_ma: 173.2  loss: 0.679745  lr: 4.55652e-17  explore_var: nan  entropy_coef: 0.001  entropy: 0.228988  grad_norm: nan
[2019-07-14 22:53:14,093 PID:73583 INFO __init__.py log_summary] Trial 0 session 0 ppo_rnn_shared_cartpole_t0_s0 [train_df] epi: 171  t: 200  wall_t: 367  opt_step: 238400  frame: 23850  fps: 64.9864  total_reward: 200  total_reward_ma: 173.35  loss: 0.624804  lr: 4.55652e-17  explore_var: nan  entropy_coef: 0.001  entropy: 0.315934  grad_norm: nan

We have not thoroughly tested RNNs yet, but your observation is true and the RecurrentNet class is limited in that sense. The hidden state is discarded and not used as input in the next forward pass. We can implement this by storing the hidden state alongside the state in agent Memory, and retrieve it during memory.sample().

This will take some time to implement, but we're currently busy with benchmarking tasks. I'm making this issue as a feature request so we can get on it as soon as we have time.

Jul 15 '19 06:07 kengz

In the meantime, you could try increasing the sequence length (seq_len) in the net component of the spec file. This will persist the hidden state for more steps.

Jul 15 '19 06:07 lgraesser

SLM-Lab SLM-Lab copied to clipboard

Real recurrent policy supported

SLM-Lab
SLM-Lab copied to clipboard