tensorforce icon indicating copy to clipboard operation
tensorforce copied to clipboard

[not issue] How to implement Soft Actor-Critic?

Open Luca96 opened this issue 5 years ago • 9 comments

Hi @AlexKuhnle, sorry for bothering you; I would like to implement the SAC algorithm, and I'm wondering if you have some suggestions for that.

In particular, I have some doubts about the following:

  1. SAC is off-policy, so a replay memory should account for this, right?
  2. How to include the entropy term as part of the objective. Could entropy_regularization do the job? If not, a entropy objective is necessary and how can I implement it? With such term, the maximum entropy objective should be implemented like Plus(Value(value='action'), Entropy(...))?
  3. The paper says, at section 4.2, that to learn the policy parameters they use the reparametrization trick, instead of the likelihood ratio gradient estimator. According to OpenAI spinning RL this could be implemented by a squashed Gaussian policy, i.e. tanh(mu(s), sigma(s) x noise) where noise is sampled from a Spherical Gaussian, i.e. N(0, 1), and mu(s) and sigma(s) are the output of a neural network that takes in input the state s. I think that a gaussian distribution is not enough for this, maybe in the best case a function approximator here instead of a linear one could do the job.
  4. Could the stuff from point 3 be replaced by a ratio-based policy_gradient + entropy objective instead?
  5. SAC uses two Q-functions, to replicate this one should do something like this, right?

I'm not trying to do an exact replica of SAC, so it won't be an issue if things are not completely the same. Thanks in advance for the help.

Luca96 avatar May 03 '20 13:05 Luca96

Hey @Luca96, it would be great to include SAC in the framework, as it's a popular algorithm. I will need to have a look at the paper to answer these questions more definitely but:

  1. Yes, I think that's basically what SAC does. On- vs off-policy is not really specified that way in Tensorforce, but it wouldn't make much sense to use a replay memory for an on-policy algorithm. Can say more later.
  2. I think what SAC does with entropy is basically the same as what entropy_regularization does, yes.
  3. Tensorforce's Gaussian behaves differently from what SAC uses, for sure. However, the way to go would be to implement a separate Distribution class similar to Gaussian (or add an argument to the existing implementation, depending on how much changes, can't say right now). I don't think you need more than a linear layer, though, since the distribution is only the "final layer" following the policy network, so overall a proper NN function approximator.
  4. That would definitely be my goal for the modularity of a SAC implementation. Changing this would probably substantially change SAC (at least if one wants to stay reasonably close to the paper), however, in Tensorforce SAC should just be one configuration of the TensorforceAgent, and another SAC-like one could well be what you describe.
  5. That's the most annoying part about SAC w.r.t. Tensorforce, which currently more or less fixes the idea to have a policy network and at most one baseline/critic/value/target network. However, SAC uses a policy, value and target network. Needs a bit of thinking/planning, but shouldn't be a deal-breaker. On the other hand, it may not be all that important to have a target in addition to a value network, a value network should do an alright job, too.

I would be happy to explore this further, however, I would definitely want to do the final implementation in the new version (branch tf2). Of course you can start experimenting on master.

AlexKuhnle avatar May 03 '20 21:05 AlexKuhnle

FYI: I have had discussions with several people who believe that SAC rather than PPO is the current state-of-the-art for continuous DRL at the moment (either they are right or wrong is another point), and start to consider not having SAC in tensorforce as a deal breaker. So believe that it would definitely be very interesting to 1) implement, 2) test vs. PPO a bit in the fashion of Fig. 1 of the initial paper https://arxiv.org/pdf/1801.01290.pdf .

jerabaul29 avatar May 05 '20 12:05 jerabaul29

That would be great, yes. I will probably struggle to run a lot of benchmarks, so input for this is generally very welcome. And I agree that SAC would really be good to have around -- whether they are right or wrong (I'm not so sure :-).

AlexKuhnle avatar May 05 '20 13:05 AlexKuhnle

@Luca96, have you started implementing something? The tf2 branch is in a "usable" state now, I would say (some things, like saving/summaries, don't work yet, but most things do), and I would be happy to look into this soon.

AlexKuhnle avatar May 10 '20 19:05 AlexKuhnle

Not yet, unfortunately. I'm quite busy with exams, but it's good to know that the tf2 branch is usable now. I'll give a try as soon as I can.

Luca96 avatar May 10 '20 20:05 Luca96

Hello @AlexKuhnle, I was just curious if there were any plans on adding SAC in the near future. Thank you for your awesome work on the library!

p-margitfalvi avatar Dec 17 '20 21:12 p-margitfalvi

Hi @p-margitfalvi , I will check what's missing to support SAC, and maybe it can be added relatively easily -- after all, it's been on the list and of repeated interest for a while. The reason why it's not available yet is the entropy-component of its policy gradient formulation, which is a quite "deep" modification, although I haven't checked for a while how much work it would actually be, so will do.

AlexKuhnle avatar Dec 19 '20 11:12 AlexKuhnle

@p-margitfalvi , I've looked into it again, and these are the features currently missing, plus giving an idea about what it would take:

(a) "Maximum entropy RL", i.e. entropy return component: should be relatively straightforward (b) "Soft" value function more along the lines of preceding work: implemented to some degree, can be done (c) Two value functions Q and V: requires extension/modification of Tensorforce architecture, and since I wouldn't want to just hack it in, this will take a bit more time (d) Two copies of Q, target copy of V: Further extensions on top of (c), again more work (note: overall SAC uses 5 networks, whereas Tensorforce's internal architecture is so far based on two networks, roughly "policy" and "value function")

I think that (a) and (b) would enable an agent type qualifying as "soft actor-critic", however, it wouldn't be equivalent to the one presented in the paper, which uses additional and "SAC-unrelated" modifications (c) and (d). So depending on what you would be happy with, the answer is either "can happen soon" or "won't happen very soon"... :-) Hope that gives an idea.

AlexKuhnle avatar Dec 22 '20 10:12 AlexKuhnle

Thanks for looking into it @AlexKuhnle, appreciate it. I believe for now implementation of a) and b) might be sufficient for my application, it's worth a shot at least.

p-margitfalvi avatar Dec 22 '20 16:12 p-margitfalvi