cleanrl icon indicating copy to clipboard operation
cleanrl copied to clipboard

Adding Hierarchical RL Algorithms

Open DavidSlayback opened this issue 3 years ago • 4 comments

Hi, I'm a PhD student doing work in hierarchical reinforcement learning (specifically Option-critic-based algorithms), and I've found this repository to be a particularly helpful starting point when trying to prototype my algorithms!

I was wondering if there was any interest in having option-critic (or other HRL) baselines in this repository? I have a few in a private repo that I could convert to fit the structure here. To be upfront, there aren't great reference implementations for most of them, so while I started from a few official baselines, I've had to resolve conflicts between them by turning to the papers and my own theoretical work.

I wanted to check before making a pull request, it seems like you have a strong focus on a curated set of easily-understood reference implementations, and HRL baselines may be a bit far afield

DavidSlayback avatar Aug 02 '22 13:08 DavidSlayback

Hi David, thanks for considering making a contribution. We would definitely be interested in having HRL algorithms. Please check out our contribution guide

The main things I am looking for are 1) single file implementations (minimal lines of code), 2) documentation explaining notable implementation details, 3) benchmarking and matching performance of reference implementations.

Thanks again!

vwxyzjn avatar Aug 02 '22 15:08 vwxyzjn

CC @kinalmehta, who is working on DIAYN #267.

vwxyzjn avatar Aug 28 '22 02:08 vwxyzjn

Sorry for not giving any updates for a while. I'm running into a couple issues following the guidelines

  1. The official reference implementations (4 based on openai-baselines ppo) differ from each other (and the papers) in significant ways. I have a version that follows the theoretical work correctly (at least as well as I can verify) in both JAX and PyTorch, but obviously that doesn't necessarily match the performance of any official implementation.

Obviously the big focus of this repo is on clean, reproducible scripts that match the implementation details, not necessarily the theory. How would you prefer I handle the differences? And how should I do a 1-1 comparison to the old implementations in a benchmark?

  1. The benchmark environments aren't included in any standard gym package. They use a lot of old Mujoco and gym-miniworld environments with mid-training transfers (which haven't been updated for many years). Using them directly would require a weird set of dependencies; alternatively, I could probably reimplement them. Preferences?

DavidSlayback avatar Aug 29 '22 19:08 DavidSlayback

Hi @DavidSlayback

Based on the repos you shared, it seems to be using very old mujoco versions, but I guess the environments should be available in the latest gym versions.

  1. Regarding the environments used in here such as HalfCheetahDir. They seem to be custom environments available from here. If possible, it would be better to update these environments to the latest dependencies and then work on those.

Can you list all the dependencies and environments causing problems and then compare what alternatives are available in the latest versions of those libraries? If custom environments are used, they can be updated to the latest versions of those libraries.

Regarding comparable performance, all the repos seem to be using python-3.6 and really old gym and mujoco requirements. So even if a direct comparison is not feasible, maybe getting a score close to the ones mentioned in the paper should be good enough. Also, analyzing trained agents as done in original papers should be a good sanity check for your implementations.

I'm facing a similar challenge and trying to work with pybullet instead of mujoco, but plan to move to the latest mujoco environments. For comparison, I plan on doing analyses similar to the ones done in the paper.

kinalmehta avatar Aug 30 '22 02:08 kinalmehta

Hi @DavidSlayback, I apologize for getting back to you so late. I am a little confused. There seems to be 4 algorithms in the hyperlinks. Which are the ones that you are interested in reproducing / contributing?

Regarding environment versions stuff — this is unfortunately one of the really tricky issues — there is not really a great solution here. My thoughts are that you should try reproducing it as best as you can according to the paper, see if you can match similar ballpark performance, if not then you can investigate implementation details and their impact on the performance.

vwxyzjn avatar Sep 27 '22 20:09 vwxyzjn

@vwxyzjn Yeah, sorry, I linked those implementations to show the divergence of even the basics of option critic implementations. While each proposes a new technique, they also make a lot of weird and somewhat arbitrary modifications, and none quite follow the papers that they serve as official implementations for.

My intention is to first create 1-1 matches for each, and then create versions that actually match the details in the paper. I have the first 2 1-1 matches done in my own branch but need to check and make sure I can match performance (thankfully standard Atari in those cases). For the continuous control ones, I can probably stick modified versions of the environments in the files themselves.

Should I put a draft PR up with my current work just to show the general approach as I work on them?

DavidSlayback avatar Sep 27 '22 21:09 DavidSlayback

That makes sense. I’d suggest put a draft PR for better visibility but only if you feel more comfortable that way.

vwxyzjn avatar Sep 27 '22 21:09 vwxyzjn