dlib icon indicating copy to clipboard operation
dlib copied to clipboard

Qlearning+sarsa v2.0

Open adrianjav opened this issue 7 years ago • 28 comments

This pull request adds more reinforcement learning functionality to dlib. Namely it implements the Qlearning and SARSA algorithms. I think I've done a good job with respect to backward compability and style code so far but let me know anything that needs a review. A list of the changes:

  • Qlearning and SARSA algorithms (files qlearning.h and sarsa.h)
  • Separated policy in its own header (policy.h) where there are two implementations (greedy_policy and epsilon_policy).
  • An example that shows both algorithms working (qlearning_sarsa_ex.cpp)
  • The same example as a unitary test (reinforcement_learning.cpp). It works for me but it'd be nice some more testing since convergence could depend on the machine.
  • An interface for the model the agents will work with (model_abstract.h).

About the mechanics: These new algorithms avoid using the process_sample of lspi. Instead they run on a model (usually an online model) that allows them to take steps and get feedback from their choices. What the algorithms really train aren't models but policies. That way you can specify a custom policy and custom initial weights and, when training, the algorithms use an epsilon-greedy policy that uses the given policy as the underlying one.

Related PR: #824

Edit: Mmmm... AppVeyor fails at the unitary tests. Any suggestions on how to test if they work? Since their convergence is non deterministic and qlearning can get worse after converging I find it hard to test.

adrianjav avatar Dec 09 '17 22:12 adrianjav

Wow, this is a lot of stuff. Thanks for working on this PR :)

I haven't looked it over very carefully, but it generally looks good. You should flesh out the comments in the qlearning_sarsa_ex.cpp example a bit so that it reads like a tutorial (e.g. explains the relevant model details so that someone who isn't already familiar with Q learning can understand and learn what it does).

Also definitely fix the failing unit tests. Most of them are working though so that's good. Make sure your random number generators are all seeded the same every run so the tests reliably do the same thing. You don't want unit tests to have some small chance of failure. Beyond that I don't have much advice other than to dig in and debug it.

Let me know when you think the PR is ready for a full review.

davisking avatar Dec 10 '17 19:12 davisking

Thanks god that the tests failed! Reviewing the code I saw a terrible bug when adding rewards. I've changed that and added custom PRNG support when training. Now everything seems to go smoothly as intended. I've also rewritten the comments in the example.

I think now it's ready for a full review.

adrianjav avatar Dec 12 '17 10:12 adrianjav

Hi! When this feature will be merged into Master?

travek avatar Dec 28 '17 19:12 travek

Some time in the next two weeks. I've been really busy with other stuff and so haven't had time to review it yet.

davisking avatar Dec 28 '17 22:12 davisking

Thanks for your comments!

I strongly agree with you in merging feature_extractor and model in one class. I might have overthought it just because feature_extractor was already there and I tried to separate the model and its representation. But at the end it's confusing for the user.

The other things come mostly from: rushing; being used to my own code style; English not being my first language.

I'll try to fix them all in the next few days.

adrianjav avatar Jan 07 '18 18:01 adrianjav

It's been a month since my last post and I feel like I have to say something. I've been applying the changes of the review but I'm still missing a few things.

I haven't forgotten it, just being really busy with my actual job. As soon as I have some spare time I'll post the last changes

adrianjav avatar Feb 05 '18 22:02 adrianjav

Finally I've applied those requested changes into the old code. The most remarkable change is that, as you suggested, I describe in the abstract two classes: offline model (the former feature extractor) and online model; So the user can choose to implement one of them depending on its necessities.

Overall I agree with the notes you took in the review, the only thing I could argue is that I returned matrices by value on purpose trusting the compiler's ROV optimization. Either way I've also changed that and reviewed the comments twice.

adrianjav avatar Feb 18 '18 15:02 adrianjav

Cool, thanks for the update. I'll review the PR sometime over the coming week.

davisking avatar Feb 19 '18 13:02 davisking

I haven't forgotten about this PR. But I've got a new baby at home and have been a little busy :). I'll get to it though.

davisking avatar Mar 06 '18 12:03 davisking

Hahaha don't worry, I'm sure you'll do it. I'm quite busy as well. Congratulations for your newborn! :baby:

adrianjav avatar Mar 06 '18 21:03 adrianjav

Thanks :)

davisking avatar Mar 07 '18 02:03 davisking

Davis, when can you push this PR ?

travek avatar Apr 15 '18 18:04 travek

Not sure. I'll get to it thought. You can use it now though, no need to wait on me. In fact, other people testing and reporting back here will help me in my own review when I get to it.

davisking avatar Apr 15 '18 21:04 davisking

Thanks for another good review Davis! I've just looked quickly the comments and I agree with what I've read so far. I'm quite busy these days but 'll try to fix everything in the next weeks

adrianjav avatar Apr 30 '18 13:04 adrianjav

No problem :)

davisking avatar Apr 30 '18 14:04 davisking

I think I've fixed everything so far. Some things that are worth mentioning (skiping documentation changes):

  • I have removed the thread-safety requirements. It doesn't that useful for those classes and "keep it simple, stupid".
  • I'm keeping the mutable random_engine attribute on the classes. I don't consider it a internal state of the object and the only reason is not static/global is because of serialization.
  • Serialization works. I have added a serialization test to see whether it works as expected.
  • For keeping the code clean, I have added to serialize.h methods for all the instances of random devices from the standard library. They are serialized by just dumping their internal code into streams.

Waiting for travis and appveyor tests to finish though.

adrianjav avatar Jun 03 '18 23:06 adrianjav

Are all the comments from the previous review fixed? Just in case Someone™ wanted to pick this up to get it merged.

sandsmark avatar Feb 25 '20 17:02 sandsmark

I'm not sure. I had my second child right around the time this PR was made, and got severely sidetracked by that and a bunch of work stuff. I never ended up looking at this PR, since it's big and time consuming to review. I admit I feel bad about this, and I've left the PR open as a reminder of my shame in reviewing this. One of the issues though is I'm not sure if anyone really wants to use this. The state-of-the-art in RL is all deep learning stuff, rather than the simpler things in this diff. So I'm hesitant to merge things that are not really useful to users (that's not to say @adrianjav didn't do a nice job. He absolutely did.)

Anyway, are you @sandsmark or @adrianjav using this stuff or interested in it? If so then maybe it's worth it. In particular, is @adrianjav using this and getting value out of it?

davisking avatar Feb 26 '20 02:02 davisking

Hi @sandsmark. This PR is two-year old, as far as I remember I fixed everything that was reviewed, although I am not sure anymore.

First, thanks @davisking :) Regarding your questions, I am not using nor interested in RL stuff. This was an attempt of making my university exercises useful for a broader audience and I found a sweet spot in your library. At this moment I am doing a PhD in machine learning, yet I don't focus on RL at all, so I don't get any real value out of this.

With that said, I believe that having these (simple) implementations in your library can't be harmful. Worst case it will be used for educational purposes or as a template to implement more complex algorithms (e.g., PPO). I will be happy to rescue this PR if that will useful for someone.

adrianjav avatar Feb 26 '20 09:02 adrianjav

Yeah, I guess at least it's useful as an educational resource. I'll go over this again this weekend and see about merging it :)

davisking avatar Feb 28 '20 03:02 davisking

Can you give me write access to this PR? See for instructions https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork

I just rebased this branch on master, fixed some conflicts, and tried to push to the PR but it's not letting me.

davisking avatar Feb 29 '20 14:02 davisking

The "Allow edits from maintainers" option was already selected for me, you should have write access to my branch.

adrianjav avatar Mar 01 '20 11:03 adrianjav

Huh. I just tried again and still the same error.

remote: Resolving deltas: 100% (108/108), completed with 11 local objects.
To https://github.com/adrianjav/dlib.git
 ! [remote rejected]   adrianjav-qlearning+sarsa -> adrianjav-qlearning+sarsa (permission denied)
error: failed to push some refs to 'https://github.com/adrianjav/dlib.git'

Anyway, maybe it's because I'm trying to push a rebased change. Can you rebase on master and resolve the errors?

davisking avatar Mar 01 '20 15:03 davisking

Ok, that is weird, this is what I actually see from my side: image

I will try to rebase to my master branch tomorrow, I don't have the git in local at the moment. I will be back to you as soon as I do it.

adrianjav avatar Mar 01 '20 21:03 adrianjav

I just rebased this branch with the latest version of dlib. I locally tested the example and unit tests and both work. Hope you can edit the branch now.

adrianjav avatar Mar 08 '20 11:03 adrianjav

Anyway, are you @sandsmark or @adrianjav using this stuff or interested in it?

For some definition of "use", I guess. I used to make a game a year for an AI competition I held, and I wanted to test SARSA with them. Just for fun, and because I've been thinking about starting the competition up again, and it's much easier to tune games before release with a non-trivial bot (and so far I only used my own trivial Q-learning implementations).

And the reason I asked was because I thought about fixing any remaining issues myself to get it merged, but I wasn't entirely sure what needed to be fixed.

sandsmark avatar Apr 10 '20 15:04 sandsmark

I'll go over this PR again and probably fix whatever needs fixing (if anything) and merge it. It's a big PR though so it needs a chunk of time where I can sit and work on it. I just haven't been able to get time for it the last few weeks. I'll do it soon though.

davisking avatar Apr 11 '20 02:04 davisking

Totally understandable, these last weeks haven't been easy for anyone. Stay safe!

adrianjav avatar Apr 11 '20 14:04 adrianjav