stable-baselines3-contrib icon indicating copy to clipboard operation
stable-baselines3-contrib copied to clipboard

[Feature request] Implement OT-TRPO

Open antonioterpin opened this issue 1 year ago • 3 comments

Hi,

we developed and tested our algorithm OT-TRPO (published at the upcoming NeurIPS2022, you can find the preprint here) using stable baselines.

Is there an interest in integrating it with the existing package? We would be happy to discuss the implementation and integration of such a feature. You can find the available code here. It should need only minor effort to (i) integrate the continuous spaces implementation with stable baselines, and (ii) automatically switch between discrete and continuous case, with the same interface. We already have the tuning parameters for many environments (found with the hyperparameter tuning functionality of stable baselines).

Cheers, Antonio

antonioterpin avatar Nov 14 '22 12:11 antonioterpin

Hello, thanks for the proposal. I will try to have a deeper look later.

Anyway, after a quick look at it, I have several questions:

  • What are the main advantages/disadvantages vs PPO/TRPO?

  • What are the runtime compared to PPO/TRPO?

  • it look like you are using different set of hyperparameters per environment. Is there any default hyperparameters that works on many problems without extra tuning?

  • what hyperparameters did you use for TRPO/PPO for Figure 1? it looks like TRPO is under-tuned at least for Swimmer.

  • do you have results on pybullet envs? (or at least on more mujoco envs)

  • according to the paper, you used TRPO/PPO/A2C implementation from SB2, not SB3, is there a reason? (timeout is not properly handled in SB2 and will have an impact on performance)

Side note: sde_sample_freq: 16 does nothing if use_sde=False (default) which seems to be the case here...

araffin avatar Nov 16 '22 12:11 araffin

Hi,

thank you for your answer!

  1. Two of the key features of optimal transport discrepancies are the following (please see the paper for more):
  • they allow us to compare probability measures (and thus policies) not sharing the same support (for which the KL divergence is infinity);
  • they encapsulate the geometry encoded by the transport cost in the action space: the discrepancy between two actions coincides with the discrepancy between the corresponding deterministic policies (whereas the KL divergence is again infinity).
  1. These are reported in Table 5. In some instances, OT-TRPO results slower than both, in others faster the TRPO, but slower than PPO. However, ours is research code: we did not spend a lot of time optimizing it, but we still used the stable-baselines implementation for TRPO/PPO. In fact, one of the goals of integrating our code in stable baselines is to reach a “production-ready” version. Moreover, as mentioned in the paper, it is not a completely fair comparison. The theoretical benefits enabled by optimal transport discrepancies compared to the KL divergence come at the price of a “much harder” constraint in the trust region. A fairer comparison is then one that compares the proposed algorithm to other methods using similar “high-quality” notions of closeness, e.g. BGPG and WNPG. These are much slower than OT-TRPO.

  2. We have not investigated this question much, but we can certainly try. Is this the case with TRPO/PPO? As far as we know, they also have a different set of hyperparameters for each of the discussed environments. This can definitely be something to look into it, but for which it would be great to have a community-reviewed optimized code.

  3. We never tuned TRPO/PPO, we used the values in the repo mentioned in the paper. We acknowledge the results might need to be updated if better hyperparameters have been found between the time we wrote the paper and today. Nonetheless, our results are in line with what we could find on the internet (e.g., see https://spinningup.openai.com/en/latest/spinningup/bench.html for Swimmer). Finally, I think it is worth mentioning that we believe the community of stable baselines plays also a big role in the game when it comes to finding better hyperparameters and optimizing the code/spotting bugs. We did not spend much time optimizing our hyperparams either. Indeed, our main focus was to provide a class of theoretically justified algorithms. We are also looking forward to seeing what the community can achieve by selecting specific transportation costs, etc. In this sense, our experimental results are more proof of concept. We foresee much better results once a “production-ready” version of the code will be available (hence the request for starting this integration, with code review, etc.)

  4. Not yet. We have results for some Mujoco environments, but we would be really happy to extend the studies to more environments.

  5. No particular reason, we were working with that version when the project started. Also, if such issues affect the performances, we definitely did not look into them for our algorithm. Our implementation at the moment is a proof of concept, hence offering a wide range of improvements.

Please let us know if we can clarify any other points. But in general, I think that implementation/performances related questions are something that we would like to look at together with you and the community, as we did not have the capacity to push too much on the software engineering side, and we believe our results can be further improved.

Thank you! All the best, Antonio

PS: Your side note is the kind of polishing we look forward :)

antonioterpin avatar Nov 17 '22 20:11 antonioterpin

We have not investigated this question much, but we can certainly try. Is this the case with TRPO/PPO?

yes, there are good defaults that work reasonably well on a wide range of problems (they are not optimal but they do work).

Nonetheless, our results are in line with what we could find on the internet (e.g., see https://spinningup.openai.com/en/latest/spinningup/bench.html for Swimmer).

It is known that most Swimmer results are under-tuned: https://arxiv.org/abs/2208.07587 getting 300+ score in 20k steps: https://twitter.com/araffin2/status/1582452208033214464

In this sense, our experimental results are more proof of concept.

then I guess as for REDQ, I would prefer to wait a bit (for instance, I waited until DroQ came, which is a practical extension of REDQ). In the meantime, you can add your repo to the SB3 project list that we have in our doc (please open a PR for that).

And feel free to share new experimental results if you have some.

araffin avatar Dec 09 '22 11:12 araffin