learning_to_adapt Set hyperparameter for reproducing publication and MPPI

Hi, I have been trying to achieve published results using this library. However, I found some issues to follow Section E in Appendix.

Could you tell me how to run MPPI mode? I only found CEM and RS (random shooting) options.

Also, If I want to set K and M differently, where should I change the code? The only parameter I found is adapt_batch_size which seems used when K equals M for GrBAL.

Third, are TS/itr and Tasks/itr (num_rollouts * max_path_length) and num_rollouts respectively?

FInally, how to use K, and M for ReBAL?

Thank you.

Jul 25 '19 05:07 jd730

@jd730 Hi, I'm interested in this work too and I'm trying to achieve the published results too.

I am also confused about your question, I hope to get your reply.

Thank you.

Aug 11 '19 04:08 morantumi

@jd730 Hi， I am confused about the parameters in the appendix. Does the HC parameter actually given in the code on github correspond to the parameter in Appendix E? I am puzzled by this. I will be very happy if I can discuss this issue with you.

In addition, I run the code in 1080Ti gpu environment, run the script python run_scripts/run_grbal.py directly, run it three times under the code parameter setting, each operation takes about 10 hours, the result is not ideal, my mailbox is whn389085539 @gmail.com, I am looking forward to communicating with you.

Thank you.

Aug 11 '19 15:08 morantumi

@morantumi I am confused with the parameters too. I think default parameters on run_grbal.py is quite different from those in Appendix E. However, some parameters as I mentioned above is still vague.

Aug 11 '19 23:08 jd730

@jd730 Today I will try to adjust the parameters in the code to the parameters in the appendix . If I have any progress, I will share it by email, and I hope to talk this issue about parameters with you.

Thank you.

Aug 12 '19 00:08 morantumi

Hi,

I have also tried run_grbal.py (default version) and it seems that it cannot reproduce the original result. Here is the output from the last iteration:

-----------------------------------------
| AverageDiscountedReturn | 10.5        |
| AverageForwardProgress  | 1.07        |
| AverageReturn           | 3.22        |
| AvgModelEpochTime       | 18.6        |
| EnvExecTime             | 1.13        |
| Epochs                  | 99          |
| Itr                     | 49          |
| ItrTime                 | 2.02e+03    |
| MaxForwardProgress      | 1.36        |
| MaxReturn               | 33.9        |
| MinForwardProgress      | 0.665       |
| MinReturn               | -38.3       |
| NumTrajs                | 5           |
| PolicyExecTime          | 157         |
| Post-Loss               | 0.01711247  |
| Pre-Loss                | 0.020023113 |
| StdForwardProgress      | 0.291       |
| StdReturn               | 31          |
| Time                    | 5.52e+04    |
| Time-EnvSampleProc      | 0.000819    |
| Time-EnvSampling        | 159         |
| Time-ModelFit           | 1.86e+03    |
| n_timesteps             | 250000      |
-----------------------------------------
Training finished

I was thinking the reason might be that 250000 steps is still too small for a decent result? I didn't try other hyper-parameters yet. Thank you for sharing these results~

Aug 13 '19 00:08 dingchenghu

@dingchenghu Hi， At first I thought this was the reason, but now I think it may be related to the too small setting of inner LR. In the Appendix E, inner LR=0.01, but the default version in the code is 0.001. So I think there may be some incorrect parameters in the code. There are still some parameters in the code that I can't match with the parameters in Appendix E. This is also my point of confusion.

Aug 13 '19 01:08 morantumi

@jd730 @dingchenghu I think the parameters K and M are divided by vaild_split_ratio. The method is train_test_split() in the meta_mlp_dynamics.py line 192 Just for Hyperparameters for the half-cheetah tasks, I think some parameters in code may should be adjusted as follows: 'inner_learning_rate': 0.001 -> 0.01 which may means Inner LR in the Appendix E, 'dynamic_model_epochs': 100 -> 50 which may means Epochs in the Appendix E, 'valid_split_ratio': 0.1 -> 0.5 which may decides K and M in the Appendix E,

'adapt_batch_size':16 -> 500 which may means Batcj Size in the Appendix E ( I'm not very sure) 'num_rollouts': 5 -> 32 which may be Tasks/itr ? 'max_path_length': 1000 -> 2000

The above is my idea, the rest of the parameters I am not clear about their corresponding variables in the code, I think that nA Train may be represented by num_steps_per_epoch in the meta_mlp_dynamics.py line 210 and nA Test may be num_steps_test in the line 212.

I will try to verify my thoughts and hope to communicate with you more. I am looking forward to your reply.

Thank you.

Aug 13 '19 08:08 morantumi

@morantumi Thank you for sharing your finding. However, adapt_batch_size is a window size for a previous history, See here Also, valid_split_ratio seems to be used for validation.

Aug 14 '19 01:08 jd730

Hi all, sorry for the late reply. I'm currently on vacation and I'll be back in 10 days. I'll take a look at your concerns, add the MPPI code, and fix any part of the code if necessary once I'm back. I'm sorry for the inconvenience.

Best, Ignasi Clavera

Aug 14 '19 06:08 iclavera

Thank you!

I also tried run_rebal. Looking at AverageReturn, it seems not working either.

After some iterations, the loss went to be nan.

--------------------------------------
| AverageDiscountedReturn | -24.7    |
| AverageForwardProgress  | -1.11    |
| AverageReturn           | -211     |
| AvgModelEpochTime       | 42.2     |
| EnvExecTime             | 4.37     |
| Epochs                  | 49       |
| Itr                     | 47       |
| ItrTime                 | 2.48e+03 |
| MaxForwardProgress      | -0.399   |
| MaxReturn               | -141     |
| MinForwardProgress      | -2.09    |
| MinReturn               | -308     |
| NumTrajs                | 5        |
| PolicyExecTime          | 367      |
| StdForwardProgress      | 0.577    |
| StdReturn               | 57.3     |
| Time                    | 6.52e+04 |
| Time-EnvSampleProc      | 0.0147   |
| Time-EnvSampling        | 372      |
| Time-ModelFit           | 2.11e+03 |
| n_timesteps             | 240000   |
--------------------------------------

 ---------------- Iteration 48 ----------------
Obtaining samples from the environment using the policy...
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:05:49
Processing environment samples...
Training dynamics model for 50 epochs ...
Training RNNDynamicsModel - finished epoch 0 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 41.84
Training RNNDynamicsModel - finished epoch 1 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 43.81
Training RNNDynamicsModel - finished epoch 2 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.18
Training RNNDynamicsModel - finished epoch 3 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.29
Training RNNDynamicsModel - finished epoch 4 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.36
Training RNNDynamicsModel - finished epoch 5 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 42.47
Training RNNDynamicsModel - finished epoch 6 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.05
Training RNNDynamicsModel - finished epoch 7 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 39.70
Training RNNDynamicsModel - finished epoch 8 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 42.27
Training RNNDynamicsModel - finished epoch 9 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.53
Training RNNDynamicsModel - finished epoch 10 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.21
Training RNNDynamicsModel - finished epoch 11 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.13
Training RNNDynamicsModel - finished epoch 12 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 42.42
Training RNNDynamicsModel - finished epoch 13 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.43
Training RNNDynamicsModel - finished epoch 14 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.87
Training RNNDynamicsModel - finished epoch 15 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 41.55
Training RNNDynamicsModel - finished epoch 16 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.03
Training RNNDynamicsModel - finished epoch 17 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.04
Training RNNDynamicsModel - finished epoch 18 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.33
Training RNNDynamicsModel - finished epoch 19 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 41.43
Training RNNDynamicsModel - finished epoch 20 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.19
Training RNNDynamicsModel - finished epoch 21 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 40.47
Training RNNDynamicsModel - finished epoch 22 --train loss: nan  valid loss: nan  valid_loss_mov_avg: nan  epoch time: 42.93

Aug 15 '19 21:08 dingchenghu

@iclavera Hi sir, Thank you very much for your reply! We all think your paper is very novel and it will be of great help to our further

research. However, we are confused with the parameters and can not match tthe parameters in the

code with those of Appendix E. And I think it is the main reason we can not have a goog result, so can

you explain more about this? Thank you very much!

More, I am confused that the loss function of Grbal in the paper is diferent from that in the code. The

former is the mean of log Pθ’ ( function(4) in paper )，but the latter is the square difference( Here ) Dose that affect the result?

Thank your very much!

Aug 16 '19 14:08 morantumi

Update:

When training with more timesteps (grbal), from 250000 to 1000000, it still doesn't work:

-----------------------------------------
| AverageDiscountedReturn | 9.08        |
| AverageForwardProgress  | 0.374       |
| AverageReturn           | -69.7       |
| AvgModelEpochTime       | 74.3        |
| EnvExecTime             | 1.12        |
| Epochs                  | 10          |
| Itr                     | 199         |
| ItrTime                 | 979         |
| MaxForwardProgress      | 0.503       |
| MaxReturn               | -56.2       |
| MinForwardProgress      | 0.159       |
| MinReturn               | -91.5       |
| NumTrajs                | 5           |
| PolicyExecTime          | 160         |
| Post-Loss               | 0.02958298  |
| Pre-Loss                | 0.032556403 |
| StdForwardProgress      | 0.116       |
| StdReturn               | 12          |
| Time                    | 7.72e+05    |
| Time-EnvSampleProc      | 0.000972    |
| Time-EnvSampling        | 162         |
| Time-ModelFit           | 817         |
| n_timesteps             | 1000000     |
-----------------------------------------

Aug 22 '19 02:08 dingchenghu

Any updates? @iclavera

Thanks!

Sep 07 '19 05:09 dingchenghu

@iclavera Hi, sir, is there any updates? Hope for your reply, thank you very much!

Oct 06 '19 13:10 morantumi

Hi, looking forward to an update on this issue. Thanks!

Oct 13 '19 18:10 rahulsiripurapu

@iclavera any updates? The code doesn't seem to work.

Jan 12 '20 10:01 laurinpaech

Hi @iclavera , please could you kindly let me know how to run the code for including the MPPI mode. Is this mode available in the given repository? Also, I require some guidance with respect to tuning the hyperparameters.

Apr 17 '20 19:04 Kartick-rocks

Hi: sir, Thank you very much for your paper and code! We all think your paper is very novel and it will be of great help to our further research.
I have a problem , it show "ERROR: Expired activation key Press Enter to exit .." and hou can I solve it?

Jul 01 '20 08:07 benchidefeng

Hello everyone,

Have someone been able to reproduce the results? I would like to reproduce Grbal results, but I'm facing the same issues.

Sep 04 '20 21:09 LucasAlegre

@morantumi Hi! Did you get the expected result? Or abandon?

May 10 '21 02:05 ArinoWang

Any updates? Anyone successfully reproduce the results from the paper?

Oct 19 '21 04:10 gunnxx

Hi,

I have also tried run_grbal.py (default version) and it seems that it cannot reproduce the original result. Here is the output from the last iteration:

-----------------------------------------
| AverageDiscountedReturn | 10.5        |
| AverageForwardProgress  | 1.07        |
| AverageReturn           | 3.22        |
| AvgModelEpochTime       | 18.6        |
| EnvExecTime             | 1.13        |
| Epochs                  | 99          |
| Itr                     | 49          |
| ItrTime                 | 2.02e+03    |
| MaxForwardProgress      | 1.36        |
| MaxReturn               | 33.9        |
| MinForwardProgress      | 0.665       |
| MinReturn               | -38.3       |
| NumTrajs                | 5           |
| PolicyExecTime          | 157         |
| Post-Loss               | 0.01711247  |
| Pre-Loss                | 0.020023113 |
| StdForwardProgress      | 0.291       |
| StdReturn               | 31          |
| Time                    | 5.52e+04    |
| Time-EnvSampleProc      | 0.000819    |
| Time-EnvSampling        | 159         |
| Time-ModelFit           | 1.86e+03    |
| n_timesteps             | 250000      |
-----------------------------------------
Training finished

I was thinking the reason might be that 250000 steps is still too small for a decent result? I didn't try other hyper-parameters yet. Thank you for sharing these results~

In the default training of Grbal, you have changed the task from None to cripple? if you don't change, the robots are always running in the same environment, which may cause non-optimal results.

Aug 12 '22 23:08 wawachen

learning_to_adapt learning_to_adapt copied to clipboard

Set hyperparameter for reproducing publication and MPPI

learning_to_adapt
learning_to_adapt copied to clipboard