SLM-Lab
SLM-Lab copied to clipboard
Real recurrent policy supported
Are you requesting a feature or an implementation?
To handle the partial MDP task, the recurrent policy is currently quite popular. We need to add a lstm layer after the original conv (or mlp) policy, and store the hidden states for training. But in SLM-lab, the RecurrentNet class has limited ablities. It is more like a concatenation of series of input states, and the hidden states of rnn are not stored, which weanken the recurrent policy seriously. For example, I used it with the default parameters to solve the cartpole task, and it failed.
python run_lab.py slm_lab/spec/experimental/ppo/ppo_cartpole.json ppo_rnn_separate_cartpole train
Even I changed the max_frame parameter of the env from 500 to 50000, the RecurrentNet still couldn't work.
[2019-07-14 21:11:38,098 PID:18904 INFO logger.py info] Session 1 done
[2019-07-14 21:11:38,287 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [train_df metrics] final_return_ma: 58.26 strength: 35.4753 max_strength: 178.14 final_strength: 37.14 sample_efficiency: 9.07107e-05 training_efficiency: 6.71198e-06 stability: 0.846315
[2019-07-14 21:11:38,468 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df] epi: 647 t: 126 wall_t: 655 opt_step: 997120 frame: 49859 fps: 76.1206 total_reward: 126 total_reward_ma: 88.02 loss: 0.610099 lr: 1.44304e-37 explore_var: nan entropy_coef: 0.001 entropy: 0.0258675 grad_norm: nan
[2019-07-14 21:11:38,835 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df] epi: 648 t: 54 wall_t: 656 opt_step: 997760 frame: 49913 fps: 76.0869 total_reward: 54 total_reward_ma: 88.02 loss: 0.554544 lr: 1.44304e-37 explore_var: nan entropy_coef: 0.001 entropy: 0.217777 grad_norm: nan
[2019-07-14 21:11:38,835 PID:18906 INFO __init__.py log_metrics] Trial 0 session 3 ppo_rnn_separate_cartpole_t0_s3 [eval_df metrics] final_return_ma: 79.4461 strength: 57.5861 max_strength: 159.64 final_strength: 54.39 sample_efficiency: 9.59096e-05 training_efficiency: 4.81586e-06 stability: 0.899133
[2019-07-14 21:11:38,836 PID:18906 INFO logger.py info] Session 3 done
[2019-07-14 21:11:39,296 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [eval_df metrics] final_return_ma: 61.299 strength: 39.439 max_strength: 178.14 final_strength: 32.64 sample_efficiency: 0.000120629 training_efficiency: 6.06361e-06 stability: 0.84144
[2019-07-14 21:11:39,794 PID:18905 INFO logger.py info] Running eval ckpt
[2019-07-14 21:11:39,939 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df] epi: 649 t: 0 wall_t: 657 opt_step: 999680 frame: 50000 fps: 76.1035 total_reward: 84.25 total_reward_ma: 78.0294 loss: 2.42707 lr: 1.44304e-37 explore_var: nan entropy_coef: 0.001 entropy: 0.135592 grad_norm: nan
[2019-07-14 21:11:40,234 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [eval_df metrics] final_return_ma: 61.299 strength: 39.439 max_strength: 178.14 final_strength: 32.64 sample_efficiency: 0.000120629 training_efficiency: 6.06361e-06 stability: 0.84144
[2019-07-14 21:11:40,236 PID:18903 INFO logger.py info] Session 0 done
[2019-07-14 21:11:41,480 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df metrics] final_return_ma: 88.02 strength: 55.0476 max_strength: 178.14 final_strength: 32.14 sample_efficiency: 8.00063e-05 training_efficiency: 4.46721e-06 stability: 0.708828
[2019-07-14 21:11:42,347 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df metrics] final_return_ma: 78.0294 strength: 56.1694 max_strength: 84.39 final_strength: 62.39 sample_efficiency: 8.97979e-05 training_efficiency: 4.50698e-06 stability: 0.860915
[2019-07-14 21:11:43,242 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df metrics] final_return_ma: 78.0294 strength: 56.1694 max_strength: 84.39 final_strength: 62.39 sample_efficiency: 8.97979e-05 training_efficiency: 4.50698e-06 stability: 0.860915
[2019-07-14 21:11:43,243 PID:18905 INFO logger.py info] Session 2 done
[2019-07-14 21:11:49,818 PID:18839 INFO analysis.py analyze_trial] All trial data zipped to data/ppo_rnn_separate_cartpole_2019_07_14_210040.zip
[2019-07-14 21:11:49,818 PID:18839 INFO logger.py info] Trial 0 done
If you have any suggested solutions
I'm afraid to cause more bugs, so I'm sorry not able to add this new feature. But I provide two examples. OpenAI baselines pytorch-a2c-ppo-acktr-gail
With this feature, I believe SLM-Lab will be the top-1 in pytorch.
Thanks in advance!
Hi @yangysc , thanks for testing the RNN. The shared network from the spec ppo_rnn_shared_cartpole
works slightly better because there are less hyperparameters to run. It yields slightly better results:
[2019-07-14 22:53:07,321 PID:73583 INFO __init__.py log_summary] Trial 0 session 0 ppo_rnn_shared_cartpole_t0_s0 [train_df] epi: 169 t: 200 wall_t: 360 opt_step: 234560 frame: 23465 fps: 65.1806 total_reward: 200 total_reward_ma: 173.03 loss: 0.0292752 lr: 4.55652e-17 explore_var: nan entropy_coef: 0.001 entropy: 0.112986 grad_norm: nan
[2019-07-14 22:53:10,775 PID:73583 INFO __init__.py log_summary] Trial 0 session 0 ppo_rnn_shared_cartpole_t0_s0 [train_df] epi: 170 t: 185 wall_t: 363 opt_step: 236480 frame: 23650 fps: 65.1515 total_reward: 185 total_reward_ma: 173.2 loss: 0.679745 lr: 4.55652e-17 explore_var: nan entropy_coef: 0.001 entropy: 0.228988 grad_norm: nan
[2019-07-14 22:53:14,093 PID:73583 INFO __init__.py log_summary] Trial 0 session 0 ppo_rnn_shared_cartpole_t0_s0 [train_df] epi: 171 t: 200 wall_t: 367 opt_step: 238400 frame: 23850 fps: 64.9864 total_reward: 200 total_reward_ma: 173.35 loss: 0.624804 lr: 4.55652e-17 explore_var: nan entropy_coef: 0.001 entropy: 0.315934 grad_norm: nan
We have not thoroughly tested RNNs yet, but your observation is true and the RecurrentNet
class is limited in that sense. The hidden state is discarded and not used as input in the next forward pass. We can implement this by storing the hidden state alongside the state in agent Memory, and retrieve it during memory.sample()
.
This will take some time to implement, but we're currently busy with benchmarking tasks. I'm making this issue as a feature request so we can get on it as soon as we have time.
In the meantime, you could try increasing the sequence length (seq_len
) in the net
component of the spec
file. This will persist the hidden state for more steps.