tensorflow-rl reproducing your stellar result on Montzuma's Revenge

Hi Steve, I am trying to reproduce the ~3600 score you achieved on Montezuma's Revenge with your dqn-cts model (as per the gif image on README).

With 30M steps counting, the model does not seem to learn. It very occasionally gets the key (+100 points) and that's all. I ran your code as it is and did not modify a single line.

Could I ask if you can reproduce 3600 "on average" with your dqn-cts?
Also would you say I should try some other hyperparameter settings other than the ones you set as default?

I look forward to your advice.

Best wishes,

Jul 21 '17 07:07 dhfromkorea

@dhfromkorea I ran the agent several times and it usually would get close to that score. If it's not getting above 100 then there's something quite seriously wrong. May I ask precisely what command you are using to run the agent and what commit you are on?

Jul 21 '17 07:07 steveKapturowski

@steveKapturowski I am running on HEAD of master branch w/ python2.7 tensorflow(cpu) 1.2.1

The command is: python2 main.py MontezumaRevenge-v0 --load_config config/dqn-cts.yaml -n 32

Jul 21 '17 08:07 dhfromkorea

Can you try adding the following options: --q_target_update_steps=30000 --max_global_steps=160000000 --epsilon_annealing_steps=500000 --replay_size=500000 --clip_norm_type=ignore

The first 4 I'm suggesting mainly for consistency with my experiments; I suspect the norm clipping may be what's really killing performance

Jul 21 '17 15:07 steveKapturowski

Hello Steve. I'm SangJin. I'm with dhfromkorea. It still doesn't look reproducible.

cmd: python2 main.py MontezumaRevenge-v0 --load_config config/dqn-cts.yaml -n 12
--q_target_update_steps=30000
--max_global_steps=160000000
--epsilon_annealing_steps=500000
--replay_size=500000
--clip_norm_type=ignore --restore_checkpoint

git: master / bcc9b2a tensorflow-gpu==1.2.1

[2017-07-29 12:54:36,278] T2 / STEP 70243145 / REWARD 0.0 / Q_MAX 1.5947 / EPS 0.1000 [2017-07-29 12:54:36] INFO [MainThread:284] ID: 2 -- RUNNING AVG: 9 ± 90 -- BEST: 400 [2017-07-29 12:54:36,278] ID: 2 -- RUNNING AVG: 9 ± 90 -- BEST: 400 [2017-07-29 12:54:44] INFO [MainThread:279] T4 / STEP 70246725 / REWARD 0.0 / Q_MAX 1.8437 / EPS 0.0100 [2017-07-29 12:54:44,166] T4 / STEP 70246725 / REWARD 0.0 / Q_MAX 1.8437 / EPS 0.0100 [2017-07-29 12:54:44] INFO [MainThread:284] ID: 4 -- RUNNING AVG: 14 ± 98 -- BEST: 400 [2017-07-29 12:54:44,167] ID: 4 -- RUNNING AVG: 14 ± 98 -- BEST: 400 [2017-07-29 12:55:03] INFO [MainThread:279] T3 / STEP 70256320 / REWARD 0.0 / Q_MAX 1.5227 / EPS 0.2000 [2017-07-29 12:55:03,665] T3 / STEP 70256320 / REWARD 0.0 / Q_MAX 1.5227 / EPS 0.2000 [2017-07-29 12:55:03] INFO [MainThread:284] ID: 3 -- RUNNING AVG: 28 ± 179 -- BEST: 400

If you are interested, we could give you access to the server running the agent, maybe we could find out what's wrong together.

Best Regards,

Jul 29 '17 13:07 sangjin-park

Hi @sangjin-park, I'd be happy to try to debug what's going on in the server but first could you try running on the commit 39e695696488df83bf6d08a1eb7df0ff4ebd109c and tell me if there's any difference?

Aug 06 '17 23:08 steveKapturowski

Hi I tried 452d57 and it looks ok.

Thanks!

Aug 13 '17 18:08 sangjin-park

I'm going to check the diff between commit 452d57 and master to see what went wrong and get a fix out asap

Aug 25 '17 21:08 steveKapturowski

@sangjin-park I was checking out commit 452d5735551c672e2ce44740514b105cb045075e and noticed something funny: the ordering of the context window is backwards which I would expect to hurt performance https://github.com/steveKapturowski/tensorflow-rl/blob/452d5735551c672e2ce44740514b105cb045075e/utils/fast_cts.pyx#L305-L308 as compared to the ordering in commit 39e695696488df83bf6d08a1eb7df0ff4ebd109c: https://github.com/steveKapturowski/tensorflow-rl/blob/39e695696488df83bf6d08a1eb7df0ff4ebd109c/utils/fast_cts.pyx#L305-L308

Did you produce your OpenAI gym evaluation from the former commit?

Aug 26 '17 22:08 steveKapturowski

My branch's window order is the former one.

context[0] = obs[i, j-1] if j > 0 else 0 context[1] = obs[i-1, j] if i > 0 else 0 context[2] = obs[i-1, j-1] if i > 0 and j > 0 else 0 context[3] = obs[i-1, j+1] if i > 0 and j < self.width-1 else 0

Sep 04 '17 02:09 sangjin-park

tensorflow-rl tensorflow-rl copied to clipboard

reproducing your stellar result on Montzuma's Revenge

tensorflow-rl
tensorflow-rl copied to clipboard