deep_q_rl
deep_q_rl copied to clipboard
Implementation of Double DQN
I was interested in implementing Double DQN in this source code, so here are my changes. Feel free to pull these into the main codebase. I didn't change much, since the Double DQN algorithm is not much different from that described in the Nature paper. I couldn't get the original tests to pass, so I was not able to add a test for Double DQN. I did test everything though, by running experiments with Breakout. Here is the performance over time:
Of course, the differences here are negligible and Breakout was named in the Double DQN paper as not having a real change under Double DQN. If I had more computing resources, I could test on the games which Double DQN makes a significant difference. Here is perhaps a more useful plot that shows how Double DQN seems to reduce value overestimates:
And here is the change required for Double DQN:
If you don't have the time to look over the changes or to test them yourself, I understand. At least this PR will allow others to use it easily if need be.
References:
van Hasselt, H., Guez, A., & Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. arXiv preprint arXiv:1509.06461.
Awesome! Been hoping someone implements double QN since the paper came out. Thanks!
On Nov 22, 2015, at 10:07 PM, Cory Walker [email protected] wrote:
I was interested in implementing Double DQN in this source code, so here are my changes. Feel free to pull these into the main codebase. I didn't change much, since the Double DQN algorithm is not much different from that described in the Nature paper. I couldn't get the original tests to pass, so I was not able to add a test for Double DQN. I did test everything though, by running experiments with Breakout. Here is the performance over time:
Of course, the differences here are negligible and Breakout was named in the Double DQN paper as not having a real change under Double DQN. If I had more computing resources, I could test on the games which Double DQN makes a significant difference. Here is perhaps a more useful plot that shows how Double DQN seems to reduce value overestimates:
And here is the change required for Double DQN:
If you don't have the time to look over the changes or to test them yourself, I understand. At least this PR will allow others to use it easily if need be.
References:
van Hasselt, H., Guez, A., & Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. arXiv preprint arXiv:1509.06461.
You can view, comment on, or merge this pull request online at:
https://github.com/spragunr/deep_q_rl/pull/52
Commit Summary
Double DQN support. Bug fix, some testing code. Checkpoint before instance shutdown. Prepare for pull request. File Changes
M deep_q_rl/launcher.py (5) M deep_q_rl/q_network.py (22) A deep_q_rl/run_double.py (66)
M deep_q_rl/run_nature.py (1) M deep_q_rl/run_nips.py (1) M deep_q_rl/test/test_q_network.py (12) Patch Links:https://github.com/spragunr/deep_q_rl/pull/52.patch https://github.com/spragunr/deep_q_rl/pull/52.diff — Reply to this email directly or view it on GitHub.
Thanks for the PR. I'm behind on reviewing, but I'm hoping to get caught up in late December / early January. It looks like the changes aren't very disruptive so there shouldn't be an issue merging.
Excellent. I'm starting a test run on space invaders since it's one where they saw a big increase. I'll let you know how it goes in a couple of days
Plot from space invaders.
This, while not being up to scratch with DeepMind's results, is, I think, much better than any result I've seen with the deep_q_rl implementation.
It's very slow to learn but seems very stable. I might try again but raising the learning rate.
Very nice! This is with the double Q-RL? It just switching the network every X steps, right?
I'm also impressed. My performance with deep_Q_RL never came close to their reports, either...
Best, N
On Mon, Nov 30, 2015 at 7:06 AM, Alejandro Dubrovsky < [email protected]> wrote:
Plot from space invaders. [image: spaceinvadersdoubleq] https://cloud.githubusercontent.com/assets/775207/11471057/9dd9c402-97b6-11e5-9b5a-ff571ea1e4a1.png
This, while not being up to scratch with DeepMind's results, is, I think, much better than any result I've seen with the deep_q_rl implementation.
It's very slow to learn but seems very stable. I might try again but raising the learning rate.
— Reply to this email directly or view it on GitHub https://github.com/spragunr/deep_q_rl/pull/52#issuecomment-160611659.
@alito Thanks for the examination. Do you mind sharing the results.csv and perhaps the results.csv from any other Space Invaders models that you have trained?
Also, here is a newer paper from DeepMind that claims better performance than Double DQN: http://arxiv.org/abs/1511.06581
Could be interesting to implement.
Here is results.csv for this run (note the extra column in there): http://organicrobot.com/deepqrl/results-doubleq.csv
I don't seem to have, or at least kept, a recent results.csv. I've got a few from June that didn't learn at all, and a few from the NIPS era. I've put one up from May which seems to be the best I've got, but I don't think there's a good comparison.
http://organicrobot.com/deepqrl/results-20150527.csv
I'm running a plain version now, but it will take a while to see what's going on.
Also, there's this: http://arxiv.org/abs/1511.05952 from last week, which, aside from doing better, it has the plot of epoch vs reward for all 57 games. From those, it seems like even their non-double Q implementation is very stable, or at least more stable than deep_q_rl seems to be at the moment.
Thanks Alejandro. I, for one am curious to see how this comparison shakes out for you. When I ran deep-Q-RL the first time with Theano, it didn't really learn for me, also.
The Prioritized Replay paper that you mentioned has been sitting on my desk, as it may also apply to my poker AI problems. Choosing the best replay batch set is a pain, once you have a lot of so-so data... and I think others who got better learning results from deep-Q-RL talked a lot about it starting to forget parts of the game, as it got better at others...
I have always suspected that they sample the games data in a more clever way than the original paper gets into. Sometimes, it's just easier to say you did the simple thing. So curious to see if they have now come clean :-)
Best, Nikolai
On Tue, Dec 1, 2015 at 6:48 AM, Alejandro Dubrovsky < [email protected]> wrote:
Here is results.csv for this run (note the extra column in there): http://organicrobot.com/deepqrl/results-doubleq.csv
I don't seem to have, or at least kept, a recent results.csv. I've got a few from June that didn't learn at all, and a few from the NIPS era. I've put one up from May which seems to be the best I've got, but I don't think there's a good comparison.
http://organicrobot.com/deepqrl/results-20150527.csv
I'm running a plain version now, but it will take a while to see what's going on.
Also, there's this: http://arxiv.org/abs/1511.05952 from last week, which, aside from doing better, it has the plot of epoch vs reward for all 57 games. From those, it seems like even their non-double Q implementation is very stable, or at least more stable than deep_q_rl seems to be at the moment.
— Reply to this email directly or view it on GitHub https://github.com/spragunr/deep_q_rl/pull/52#issuecomment-160944253.
The run without double-q hasn't finished, but it's not going to go anywhere from its current state. I've put the results up: http://organicrobot.com/deepqrl/results-20151201.csv
Here's the plot:
It does better than I expected. Looks stable if nothing else. Double-Q looks like a substantial improvement in this case.
@moscow25 they've released their code, so I suspect they are not cheating in any way they haven't mentioned. I haven't tested their code though, but it wouldn't be hard to find out if they aren't doing as well as they claimed on their papers.
Awesome!
I meant that tongue in cheek. Any yes, they released code, so it happened :-)
Just saying that it's always hard to specify a tech system precisely, especially in 7 pages. And this presumes that people who wrote the system remember every decision explored and taken.
Glad to see the double Q RL working so well. I kept starting ok but then diverging into NaN territory why I ran the (Lasagne version) on this when it came out. Seeing to converge more steady now is great. The idea from that paper is simple and glad it just works.
Over-optimism is a huge problem for my high variance poker AI problems. So optimistic to try this version now. Thanks again for running the baseline.
Best, Nikolai
On Dec 4, 2015, at 6:56 AM, Alejandro Dubrovsky [email protected] wrote:
The run without double-q hasn't finished, but it's not going to go anywhere from its current state. I've put the results up: http://organicrobot.com/deepqrl/results-20151201.csv
Here's the plot:
It does better than I expected. Looks stable if nothing else. Double-Q looks like a substantial improvement in this case.
@moscow25 they've released their code, so I suspect they are not cheating in any way they haven't mentioned. I haven't tested their code though, but it wouldn't be hard to find out if they aren't doing as well as they claimed on their papers.
— Reply to this email directly or view it on GitHub.
There seems to be a bug in your implementation: as far as I can see you are calculating maxaction based on q_vals (which contains the Q values for s_t and NOT s_{t+1}). To fix this you have to do a second forward pass through the current q network, using the next state. That would look like this: `
q_vals = lasagne.layers.get_output(self.l_out, states / input_scale)
if self.freeze_interval > 0:
next_q_vals = lasagne.layers.get_output(self.next_l_out,
next_states / input_scale)
else:
next_q_vals = lasagne.layers.get_output(self.l_out,
next_states / input_scale)
next_q_vals = theano.gradient.disconnected_grad(next_q_vals)
if self.use_double:
# also get q values for next_states
q_vals_next_current = lasagne.layers.get_output(self.l_out, next_states / input_scale)
maxaction = T.argmax(q_vals_next_current, axis=1, keepdims=False)
temptargets = next_q_vals[T.arange(batch_size),maxaction].reshape((-1, 1))
target = (rewards +
(T.ones_like(terminals) - terminals) *
self.discount * temptargets)
`
Note by @stokasto sounds right. I'll do some testing