ngx-toastr [WIP] Synchronous PPO Agent

Hi everyone,

This is my first PR (and a big one relatively speaking) implementing PPO Agent. Due to limitation with keras-rl's current architecture (and not wanting to change it without first discussing and gaining consensus from the community), this version is synchronous instead of the full A3C one. Generalized Advantage Estimator and the clipped surrogate objective are included, but not the adaptive KL penalty.

This implementation intents to support actor network with dummy input and (by necessity with how PPO works) action generated according to a specified probability distribution. The dummy input part is useful to incorporate parameters of the probability distribution into the network so that they too can be tuned.

I'm now in the integration testing stage, where I will try to iteratively discover and resolve remaining bugs/issues with the codes. The current problem I'm looking at is that for the original actor network, once all the rituals are done right, you can skip supplying the values for the dummy inputs during evaluation/prediction (even though the inputs list for the model must still include it). However after cloning the actor this doesn't work anymore.

For maintainer/merger/reviewer:

Because this branch diverged/originated a pretty long time ago, I would suggest having a parallel branch on the main repo and merge into that first. Especially considering all the changes/updates in dependency version (keras main for example), re-org, that kind of stuff that's been on-going and accelerating recently.

Thanks!

(Edit: attach jupyter notebooks used during development, one for exploration/concept, another for testing) PPO_JupyterNotebooks.zip

Mar 26 '18 14:03 lemonteaa

why not merge in master?

May 07 '18 14:05 anpark

@anpark We're getting back on tracks with the repo but I haven't found the time yet look into this PR. Hopefully, will do it in may!

May 07 '18 20:05 RaphaelMeudec

@anpark Well because i) this is a relatively big change and ii) it branched out a long time ago. Anyway, I think the core dev team can merge it back themselves once they are comfortable/satisfied with the maturity of the code.

@RaphaelMeudec Thanks! Also kudos to the hard work of keeping this project alive and running~

So I finally have time to look at the problem. Seems to me that this is a missing feature/detail in keras. Specifically InputLayer does not export the input_tensor parameter (which contains the constant tensor), so it is lost during cloning. The visible effect is that:

The InputLayer.is_placeholder attr is wrong, and
The Model._feed_inputs and related attr is the full list of input rather than only what is needed.

Ideally keras should handle this. Now workarounds seems to be not too easy in this case because:

The cloning process is dynamic and automatic (the config dict for each object is passed to the class's constructor), so intervention after-the-fact (i.e. after the model has been cloned) would be hard/delicate.
I'm not sure if it is possible to override the get_config method of InputLayer: i) We are trying to export a tensorflow/backend-of-your-choice constant tensor, and ii) Somewhere there is a restriction/check that input layers must use the InputLayer class.

The only fallback I can think of now is to manually supply those dummy input values, external to the keras library. But then I would need to read the constant tensor's value, which at least in tensorflow I heard is again not easy (you can only read it in a session, mostly). And this solution is rather ugly IMO.

Jun 09 '18 10:06 lemonteaa

Okay, did a not-so-quick fix using the ugly approach, for now. I checked that it can run on my machine, but haven't verified the trained agent actually perform good enough. And there is some unpolished edges too, esp. I'm not 100% confident about the dimension of the action input (a bit tired now), and the old problem with handling NaN (or other special IEEE numbers) gracefully in the gaussian sampler (both sampling and the distribution function). One point to note is that even if it is more polished, I think it will need to somehow deal with the case when sigma = 0 (as that implies a blow-up of the probability distribution function).

I used the following settings:

agent = PPOAgent(actor, 1, critic, memory, sampler, nb_steps=500)
agent.fit(env, nb_steps=5000, nb_max_episode_steps=400)

Jun 10 '18 15:06 lemonteaa

Please let me know if your code supports synchronous multiagent learning.

Jun 20 '18 05:06 mirraaj

@lemonteaa Please let me know the current status of your implementation. I am looking into your codes for the time being. I have implemented ACER using keras-rl. For the buffer memory, I will be implementing a general code to be used for all the algorithm. Give me some time for this.

Let me know if you need any help from my side. I may be able to help you out.

Aug 09 '18 08:08 mirraaj

You may want to have a look at #232 and #231

Aug 09 '18 09:08 mirraaj

@mirraaj Hi, I'm busy with other projects currently. The only remaining issues as far as I know are:

Need to integrate against a more realistic test case (I tested using a toy example), say the 3D robotic body problem presented in the papers.
The Gaussian sampler appears to sometimes generate NaN in my test. The expression used there right now isn't very elegant also - I tried to use the log transform to handle range issue for the standard deviation parameter (which, by itself, should be >= 0 obviously, and 0 would be an edge case).

As for the buffer, since it is still the old version back when I'm doing development, and PPO itself has specific needs (for the Generalised Advantage Estimation you need to do a pass over the data), I remember that I customised it (not sure though). But that component should work - may be slow, but works.

Let me know if you have any other questions, thanks.

Aug 09 '18 09:08 lemonteaa

ngx-toastr ngx-toastr copied to clipboard

[WIP] Synchronous PPO Agent

ngx-toastr
ngx-toastr copied to clipboard