dqn Areas for improvement

I'm working with this code and have made a few changes already (would submit a pull request but they way I've done them is pretty hacky and I've never done pull requests before :) ). They are:

changing hyperparameters (.9 --> .99 for discount, .1 to .05 for epsilon). These are based on the Mnih et al. 2015 hyperparameters. Not sure what the performance effect is but .1 seems high for a final epsilon level.
adding video output
epsilon annealing (this was done hackily, by manually specifying epsilons for different episode intervals, but could be done more cleanly. Per Mnih et al., I am starting with epsilon of 1 and annealing roughly linearly for a thousand episodes).

Other possible areas for improvement:

grayscaling for efficiency
frame skip for efficiency

I'd potentially be interested in pull requesting some of these if I can figure out how, but just thought I'd post this first to get thoughts on the above/see if people have other ideas for key areas of improvement.

May 04 '16 06:05 milesbrundage

Hi @milesbrundage, Thanks for trying out the code.

I like your ideas for improvement. I had been trying similar stuff myself, but didn't find time to complete the experiments.

If you find success with certain hyperparameters and modifications, do send in a pull request. It would help people get started with DQNs, if the default run works and learns a good policy.

Thanks again!

May 06 '16 01:05 sherjilozair

Sounds good! Just started a new run with epsilon annealing and a lot of hyperparameter changes... will see how that goes and send a pull request if it goes well.

May 06 '16 03:05 milesbrundage

How's your improvement? I would like to make grayscaling and frame skip too.

May 31 '16 21:05 ShibiHe

I unfortunately haven't had time to do frame skip or grayscaling yet, but have been running training on Breakout with changes to hyperparameters and with epsilon annealing for about 7000 iterations so far - still early but I'm hopeful it will improve a lot eventually. If you are interested, here is my example.py with the epsilon annealing and video recording (record every 100 iterations, output to /tmp/) code. I think the epsilon annealing should probably go on longer, but this is one way to do it/is easily modified to go on longer by changing the numbers. It explores at the specified epsilon for 99 iterations and then goes full exploitation just for recording purposes.

import sys
import gym
from dqn import Agent

num_episodes = 10000

env_name = sys.argv[1] if len(sys.argv) > 1 else "Breakout-v0"
env = gym.make(env_name)
env.monitor.start('/tmp/Breakout4-v0', video_callable=lambda count: count % 100 == 0)

agent = Agent(state_size=env.observation_space.shape,
              number_of_actions=env.action_space.n,
              save_name=env_name)

for e in xrange(num_episodes):
    if e < 100 and (e % 100 == 0) == False:
        epsilon = 1
    if 99 < e < 200 and (e % 100 == 0) == False:
        epsilon = .9
    if 199 < e < 300 and (e % 100 == 0) == False:
        epsilon = .8
    if 299 < e < 400 and (e % 100 == 0) == False:
        epsilon = .7
    if 399 < e < 500 and (e % 100 == 0) == False:
        epsilon = .6
    if 499 < e < 600 and (e % 100 == 0) == False:
        epsilon = .5
    if 599 < e < 700 and (e % 100 == 0) == False:
        epsilon = .4
    if 699 < e < 800 and (e % 100 == 0) == False:
        epsilon = .3
    if 799 < e < 900 and (e % 100 == 0) == False:
        epsilon = .2
    if 899 < e < 1000 and (e % 100 == 0) == False:
        epsilon = .1
    if 1000 < e and (e % 100 == 0) == False:
        epsilon = .05
    if e % 100 == 0:
        epsilon = 0
    if e == 0:
        epsilon = 0

    observation = env.reset()
    done = False
    agent.new_episode()
    total_cost = 0.0
    total_reward = 0.0
    frame = 0
    while not done:
        frame += 1
        #env.render()
        action, values = agent.act(observation)
        #action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        total_cost += agent.observe(reward)
        total_reward += reward
    print "total reward", total_reward
    print "mean cost", total_cost/frame

env.monitor.close()

May 31 '16 22:05 milesbrundage

(had an error in the above the first time I posted it, but just fixed - my computer has crashed a few times while running this so sometimes I've changed it when restoring, but I think the above is good now - let me know if you find any issues!)

Update: this is now a pull request... I've never done a pull request before so go easy on me if I did it wrong ;)

May 31 '16 22:05 milesbrundage

I also just saw that the description of the Breakout environment (and the other Atari environments) seems to suggest actions are already automatically repeated, though not sure how this should relate to implementing frame skip (?): https://gym.openai.com/envs/Breakout-v0

Jun 01 '16 08:06 milesbrundage

dqn dqn copied to clipboard

Areas for improvement

dqn
dqn copied to clipboard