oyster
oyster copied to clipboard
Discrete action_space
Hi katerakelly, thank you very much for sharing the code for your paper. I think your approach is very promising.
Now, I am trying to implement your method to my application which has discrete action space. Therefore, I may need to fix some of your interfaces. I have already made some changes in your class NormalizedBoxEnv()
in wapper.py
so that it can pass discrete action space. And I am planning to revise your SAC. So my question is, can you give some suggestions on how to revise your SAC? Is there anything I need to be careful of?
Also, could please tell me how can I generate rollouts before the adaptation during meta testing just to show the improvement.
For discrete action spaces, you can simplify SAC in some ways, since now you can have the soft Q-function output the distribution of Q-values over all actions for a given state rather than the value for a single (s, a) pair. This might be helpful to you: https://arxiv.org/pdf/1910.07207.pdf To be honest, I might consider using the garage implementation of SAC and PEARL, here: https://github.com/rlworkgroup/garage That version is benchmarked regularly and the SAC there has been shown in some cases to perform better than the SAC here which originates from rlkit. Their implementation is based on mine and reads quite similarly.
To generate the pre-adaptation rollouts during meta-testing, this information is being collected here: https://github.com/katerakelly/oyster/blob/master/rlkit/core/rl_algorithm.py#L457 which collects the average return per adaptation rollout. The rollouts up to num_exp_traj_eval (default is 2) will be with z sampled from the prior, so will be pre-adaptation. You could save their returns separately as another metric.
Thank you very much for your prompt reply. I have checked the garage repo and from the first round of browsing, it is hard to tell where do they implement SAC with discrete action space. Could you please give me a little bit more instruction? That repo is way too big for me.
They don't implement SAC with discrete actions in garage, you would have to modify it there as well. I just mentioned it in case it might be a better repo for you. The SAC implementation in that repo is here: https://github.com/rlworkgroup/garage/blob/master/src/garage/torch/algos/sac.py
For discrete action spaces, you can simplify SAC in some ways, since now you can have the soft Q-function output the distribution of Q-values over all actions for a given state rather than the value for a single (s, a) pair. This might be helpful to you: https://arxiv.org/pdf/1910.07207.pdf To be honest, I might consider using the garage implementation of SAC and PEARL, here: https://github.com/rlworkgroup/garage That version is benchmarked regularly and the SAC there has been shown in some cases to perform better than the SAC here which originates from rlkit. Their implementation is based on mine and reads quite similarly.
To generate the pre-adaptation rollouts during meta-testing, this information is being collected here: https://github.com/katerakelly/oyster/blob/master/rlkit/core/rl_algorithm.py#L457 which collects the average return per adaptation rollout. The rollouts up to num_exp_traj_eval (default is 2) will be with z sampled from the prior, so will be pre-adaptation. You could save their returns separately as another metric.
Since revising the SAC to discrete action space version will make too many changes in your algorithm, I just made my env to have continuous action space.
And I understand that the reason of "the rollouts up to num_exp_traj_eval (default is 2) will be with z sampled from the prior, so will be pre-adaptation " is that, in your collect_paths
function:
https://github.com/katerakelly/oyster/blob/44e20fddf181d8ca3852bdf9b6927d6b8c6f48fc/rlkit/core/rl_algorithm.py#L361, the agent will not infer the posterior until num_exp_traj_eval
paths have been collected. Thank you very much for helping me.
Sorry to bother you again. Could you please tell me the definition of these three numbers in the process.csv
?
And another question is that in the
online_train_epoch
file, I have 3 columns. The first column is before adaptation (default num_exp_traj_eval
is actually 1). What are the other 2 columns? And what parameter is the number 2 corresponds to?
Hi, sorry I think I never saw you reopened it! See this issue for the definitions of these metrics: https://github.com/katerakelly/oyster/issues/27