ReinforcementLearning.jl icon indicating copy to clipboard operation
ReinforcementLearning.jl copied to clipboard

WIP: Add MPO in zoo

Open HenriDeh opened this issue 3 years ago • 3 comments

I'm opening this as a draft so discussions are possible early. This implements the MPO algorithm from this paper and its improved version PR Checklist

  • [ ] Update NEWS.md?
  • [ ] Add docstrings
  • [x] Handle the case of a discrete actor. For this, I was wondering if a DiscreteNetwork akin to GaussianNetwork may be a better approach than considering that if a GN is not used, then it must be a Discrete actor.
  • [x] Add some tests
  • [x] Does this handle distributed environments?
  • [x] Handle legal action masks
  • [ ] Decide default HPs
  • [ ] Make experiments with each network
  • [ ] Remove normalizer from networks ?
  • [ ] Make a dedicated doc page

HenriDeh avatar Mar 17 '22 11:03 HenriDeh

Nice work!

I may not have the time to review it in the next two or three weeks. This PR is relatively independent, so feel free to merge it when you think it's ready.

findmyway avatar Mar 18 '22 03:03 findmyway

Maybe use a more general batch sampling.

It's on my todo list but I never got the time to implement it. Ideally the sampling step is independent while the training process is reactive. This may introduce some breaking changes to the current design. I'll write down my thoughts and discuss with you later.

findmyway avatar Mar 18 '22 03:03 findmyway

Just a heads up, I will resume committing to this PR now that Trajectories has been implemented. Several changes are needed to reach a mergeable state.

HenriDeh avatar Jun 23 '22 15:06 HenriDeh

There we go, it's finally done. This PR adds MPO, you can find details in the dedicated doc page. It supports Categorical, Gaussian, and Full Covariance Gaussian policies. Compared to the MPO algorithm described in the related paper (see above), it does not support two main things:

  • It uses 1-step TD learning to update the critic network, whereas the paper uses retrace. Implementing retrace is a WIP.
  • It does not support distributed learners with gradient pooling. This is for later.

I implemented three experiments in the test suite, one for each type of policy. They all learn a perfect CartPole policy in less than a minute using only a CPU, at least on my computer.

HenriDeh avatar Dec 22 '22 11:12 HenriDeh