oyster icon indicating copy to clipboard operation
oyster copied to clipboard

Discrete action_space

Open songanz opened this issue 4 years ago • 6 comments

Hi katerakelly, thank you very much for sharing the code for your paper. I think your approach is very promising.

Now, I am trying to implement your method to my application which has discrete action space. Therefore, I may need to fix some of your interfaces. I have already made some changes in your class NormalizedBoxEnv() in wapper.py so that it can pass discrete action space. And I am planning to revise your SAC. So my question is, can you give some suggestions on how to revise your SAC? Is there anything I need to be careful of?

Also, could please tell me how can I generate rollouts before the adaptation during meta testing just to show the improvement.

songanz avatar Aug 06 '20 18:08 songanz

For discrete action spaces, you can simplify SAC in some ways, since now you can have the soft Q-function output the distribution of Q-values over all actions for a given state rather than the value for a single (s, a) pair. This might be helpful to you: https://arxiv.org/pdf/1910.07207.pdf To be honest, I might consider using the garage implementation of SAC and PEARL, here: https://github.com/rlworkgroup/garage That version is benchmarked regularly and the SAC there has been shown in some cases to perform better than the SAC here which originates from rlkit. Their implementation is based on mine and reads quite similarly.

To generate the pre-adaptation rollouts during meta-testing, this information is being collected here: https://github.com/katerakelly/oyster/blob/master/rlkit/core/rl_algorithm.py#L457 which collects the average return per adaptation rollout. The rollouts up to num_exp_traj_eval (default is 2) will be with z sampled from the prior, so will be pre-adaptation. You could save their returns separately as another metric.

katerakelly avatar Aug 06 '20 18:08 katerakelly

Thank you very much for your prompt reply. I have checked the garage repo and from the first round of browsing, it is hard to tell where do they implement SAC with discrete action space. Could you please give me a little bit more instruction? That repo is way too big for me.

songanz avatar Aug 07 '20 03:08 songanz

They don't implement SAC with discrete actions in garage, you would have to modify it there as well. I just mentioned it in case it might be a better repo for you. The SAC implementation in that repo is here: https://github.com/rlworkgroup/garage/blob/master/src/garage/torch/algos/sac.py

katerakelly avatar Aug 07 '20 17:08 katerakelly

For discrete action spaces, you can simplify SAC in some ways, since now you can have the soft Q-function output the distribution of Q-values over all actions for a given state rather than the value for a single (s, a) pair. This might be helpful to you: https://arxiv.org/pdf/1910.07207.pdf To be honest, I might consider using the garage implementation of SAC and PEARL, here: https://github.com/rlworkgroup/garage That version is benchmarked regularly and the SAC there has been shown in some cases to perform better than the SAC here which originates from rlkit. Their implementation is based on mine and reads quite similarly.

To generate the pre-adaptation rollouts during meta-testing, this information is being collected here: https://github.com/katerakelly/oyster/blob/master/rlkit/core/rl_algorithm.py#L457 which collects the average return per adaptation rollout. The rollouts up to num_exp_traj_eval (default is 2) will be with z sampled from the prior, so will be pre-adaptation. You could save their returns separately as another metric.

Since revising the SAC to discrete action space version will make too many changes in your algorithm, I just made my env to have continuous action space.

And I understand that the reason of "the rollouts up to num_exp_traj_eval (default is 2) will be with z sampled from the prior, so will be pre-adaptation " is that, in your collect_paths function: https://github.com/katerakelly/oyster/blob/44e20fddf181d8ca3852bdf9b6927d6b8c6f48fc/rlkit/core/rl_algorithm.py#L361, the agent will not infer the posterior until num_exp_traj_eval paths have been collected. Thank you very much for helping me.

songanz avatar Aug 07 '20 22:08 songanz

Sorry to bother you again. Could you please tell me the definition of these three numbers in the process.csv? image And another question is that in the online_train_epoch file, I have 3 columns. The first column is before adaptation (default num_exp_traj_eval is actually 1). What are the other 2 columns? And what parameter is the number 2 corresponds to?

songanz avatar Aug 11 '20 23:08 songanz

Hi, sorry I think I never saw you reopened it! See this issue for the definitions of these metrics: https://github.com/katerakelly/oyster/issues/27

katerakelly avatar Aug 16 '21 21:08 katerakelly