tensorforce
tensorforce copied to clipboard
[not issue] How to implement Soft Actor-Critic?
Hi @AlexKuhnle, sorry for bothering you; I would like to implement the SAC algorithm, and I'm wondering if you have some suggestions for that.
In particular, I have some doubts about the following:
- SAC is off-policy, so a
replaymemory should account for this, right? - How to include the entropy term as part of the objective. Could
entropy_regularizationdo the job? If not, aentropyobjective is necessary and how can I implement it? With such term, the maximum entropy objective should be implemented likePlus(Value(value='action'), Entropy(...))? - The paper says, at section 4.2, that to learn the policy parameters they use the reparametrization trick, instead of the likelihood ratio gradient estimator.
According to OpenAI spinning RL this could be implemented by a squashed Gaussian policy, i.e.
tanh(mu(s), sigma(s) x noise)wherenoiseis sampled from a Spherical Gaussian, i.e.N(0, 1), andmu(s)andsigma(s)are the output of a neural network that takes in input the states. I think that agaussiandistribution is not enough for this, maybe in the best case a function approximator here instead of alinearone could do the job. - Could the stuff from point 3 be replaced by a
ratio-based policy_gradient + entropyobjective instead? - SAC uses two Q-functions, to replicate this one should do something like this, right?
I'm not trying to do an exact replica of SAC, so it won't be an issue if things are not completely the same. Thanks in advance for the help.
Hey @Luca96, it would be great to include SAC in the framework, as it's a popular algorithm. I will need to have a look at the paper to answer these questions more definitely but:
- Yes, I think that's basically what SAC does. On- vs off-policy is not really specified that way in Tensorforce, but it wouldn't make much sense to use a replay memory for an on-policy algorithm. Can say more later.
- I think what SAC does with entropy is basically the same as what
entropy_regularizationdoes, yes. - Tensorforce's Gaussian behaves differently from what SAC uses, for sure. However, the way to go would be to implement a separate
Distributionclass similar toGaussian(or add an argument to the existing implementation, depending on how much changes, can't say right now). I don't think you need more than alinearlayer, though, since the distribution is only the "final layer" following the policy network, so overall a proper NN function approximator. - That would definitely be my goal for the modularity of a SAC implementation. Changing this would probably substantially change SAC (at least if one wants to stay reasonably close to the paper), however, in Tensorforce SAC should just be one configuration of the
TensorforceAgent, and another SAC-like one could well be what you describe. - That's the most annoying part about SAC w.r.t. Tensorforce, which currently more or less fixes the idea to have a policy network and at most one baseline/critic/value/target network. However, SAC uses a policy, value and target network. Needs a bit of thinking/planning, but shouldn't be a deal-breaker. On the other hand, it may not be all that important to have a target in addition to a value network, a value network should do an alright job, too.
I would be happy to explore this further, however, I would definitely want to do the final implementation in the new version (branch tf2). Of course you can start experimenting on master.
FYI: I have had discussions with several people who believe that SAC rather than PPO is the current state-of-the-art for continuous DRL at the moment (either they are right or wrong is another point), and start to consider not having SAC in tensorforce as a deal breaker. So believe that it would definitely be very interesting to 1) implement, 2) test vs. PPO a bit in the fashion of Fig. 1 of the initial paper https://arxiv.org/pdf/1801.01290.pdf .
That would be great, yes. I will probably struggle to run a lot of benchmarks, so input for this is generally very welcome. And I agree that SAC would really be good to have around -- whether they are right or wrong (I'm not so sure :-).
@Luca96, have you started implementing something? The tf2 branch is in a "usable" state now, I would say (some things, like saving/summaries, don't work yet, but most things do), and I would be happy to look into this soon.
Not yet, unfortunately. I'm quite busy with exams, but it's good to know that the tf2 branch is usable now. I'll give a try as soon as I can.
Hello @AlexKuhnle, I was just curious if there were any plans on adding SAC in the near future. Thank you for your awesome work on the library!
Hi @p-margitfalvi , I will check what's missing to support SAC, and maybe it can be added relatively easily -- after all, it's been on the list and of repeated interest for a while. The reason why it's not available yet is the entropy-component of its policy gradient formulation, which is a quite "deep" modification, although I haven't checked for a while how much work it would actually be, so will do.
@p-margitfalvi , I've looked into it again, and these are the features currently missing, plus giving an idea about what it would take:
(a) "Maximum entropy RL", i.e. entropy return component: should be relatively straightforward (b) "Soft" value function more along the lines of preceding work: implemented to some degree, can be done (c) Two value functions Q and V: requires extension/modification of Tensorforce architecture, and since I wouldn't want to just hack it in, this will take a bit more time (d) Two copies of Q, target copy of V: Further extensions on top of (c), again more work (note: overall SAC uses 5 networks, whereas Tensorforce's internal architecture is so far based on two networks, roughly "policy" and "value function")
I think that (a) and (b) would enable an agent type qualifying as "soft actor-critic", however, it wouldn't be equivalent to the one presented in the paper, which uses additional and "SAC-unrelated" modifications (c) and (d). So depending on what you would be happy with, the answer is either "can happen soon" or "won't happen very soon"... :-) Hope that gives an idea.
Thanks for looking into it @AlexKuhnle, appreciate it. I believe for now implementation of a) and b) might be sufficient for my application, it's worth a shot at least.