reinforcement-learning DQN solution results peak at ~35 reward

Hi Denny,

Thanks for this wonderful resource. It's been hugely helpful. Can you say what your results are when training the DQN solution? I've been unable to reproduce the results of the DeepMind paper. I'm using your DQN solution below, although I did try mine first :)

While training, the TensorBoard "episode_reward" graph peaks at an average reward of ~35, and then tapers off. When I run the final checkpoint with a fully greedy policy (no epsilon), I get similar rewards.

The DeepMind paper cites rewards around 400.

I have tried 1) implementing the error clipping from the paper, 2) changing the RMSProp arguments to (I think) reflect those from the paper, 3) changing the "replay_memory_size" and "epsilon_decay_steps" to the paper's 1,000,000, 4) enabling all six agent actions from the Open AI Gym. Still average reward peaks at 35.

Any pointers would be greatly appreciated.

Screenshot is of my still-running training session at episode 4878 with all above modifications to dqn.py:

Nov 15 '16 17:11 nerdoid

I've been seeing the same and I haven't quite figured out what's wrong yet (debugging turnaround time is quite long). I recently implemented A3C and also saw results peak earlier than they should.

The solutions share some of the code so right now my best guess is that it could be something related to the image preprocessing code or the way consecutive states are stacked before being fed to the CNN. Perhaps I'm doing something wrong there and throwing away information.

If you have time to look into that of course it'd be super appreciated.

Nov 15 '16 18:11 dennybritz

I saw the same on a3c after 1.9 million steps:

A3C on Breakout

And attributed it to the fact that I only used 4 learners (2 are hyper threads, only 2 real cores). In the paper, the training is orders of magnitude slower due to less stochasticity from fewer agents. The variance remained huge too, and it plateaus maybe too quickly in about 38 of score.

I started reviewing DQN first (though I got a high level view of how A3c is implemented too and will come back to it later), and made some dumb questions so far, as I was seeing the same level of performance.

One thing that would help diagnose performance is building the graph so we can inspect it in TensorBoard. Other than that, it could be many things. The most impactful is exactly how the backdrop is behaving, and (for me) if the Gather operation and the way the optimization is requested result in exactly what is wanted. Probably, it is, but that's where I am looking now.

Nov 15 '16 18:11 fferreres

Thanks for helping out.

The most impactful is exactly how the backdrop is behaving, and (for me) if the Gather operation and the way the optimization is requested result in exactly what is wanted. Probably, it is, but that's where I am looking now.

If the backprop calculation was wrong I think the model wouldn't learn at all. It seems unlikely that it starts learning but bottoms out with an incorrect gradient. Unlikely, but not impossible. It's definitely worth looking into, but it'd be harder to debug than looking at the preprocessing. A simple thing to try would be to multiply with a mask matrix (a one hot vector) instead of using tf.gather.

Adding the graph to Tensorboard should be simple. It can be passed to the saver that is created in the code, e.g. tf.train.SummaryWriter(..., sess.graph).

Nov 15 '16 18:11 dennybritz

Thanks for the responses.

So I manually performed some environment steps and displayed their preprocessed forms. An example is below. It looks OK to me. The paper also mentions taking the max of the current and previous frames for each pixel location in order to handle flickering in some games. It seems to be unnecessary here since I didn't observe the ball disappearing between one step and the next.

I'm happy to check out the stacking code next. Looking at TensorFlow's Convolution documentation, is it possible that information is being lost by passing the state histories to conv2d as channels? Depending on how conv2d combines channel filters, maybe history is getting reduced to one state in the first convolutional layer. Maybe depthwise_conv2d is relevant or, alternatively, flattening the stack prior to feeding in to the network. I'm more comfortable with RNNs than with CNNs, so I could be off base here though :)

Nov 15 '16 22:11 nerdoid

Depending on how conv2d combines channel filters, maybe history is getting reduced to one state in the first convolutional layer. Maybe depthwise_conv2d is relevant or, alternatively, flattening the stack prior to feeding in to the network. I'm more comfortable with RNNs than with CNNs, so I could be off base here though :)

I think you are right in that the history is being reduced to one representation after the first convolutional layer, but I think that's how it's supposed to be - You want to combine all the historic frames to create a new representation. If you used depthwise_conv2d you would never combine information across frames.

I do agree that it seems likely that something is wrong in the CNN though... It could be a missing bias term (there is no bias initializer also in the fully connected layer), wrong initializer, etc..

Nov 15 '16 22:11 dennybritz

I just looked at the paper again:

The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action.

In my code the final hidden layer does not have a ReLU activation (or bias term). Maybe that's it?

Nov 15 '16 22:11 dennybritz

Nice catch. That does seem like it would limit the network's expressiveness. Unless you already have something running, I'll go ahead and fire up an instance and give it a shot with ReLU on the hidden fully connected layer. Can also add biases to both fully connected layers (probably initialized to zero?).

Nov 15 '16 23:11 nerdoid

I'm not running it right now so it'd be great if you could try. In my experience the biases never helped much but adding them won't hurt. Zero-initialized should be fine.

Nov 15 '16 23:11 dennybritz

I don't know how to add the graph the right way to review with TensorBoard, since at tf.train.SummaryWriter(..., sess.graph)time the session isn't created yet (but much later). If I try to change things I'll refactor the wrong way.
Some other things that can affect performance and that are discussed in several forums as things that have been done by the DeepMind team:

Clipping of rewards to range (1,-1) - I think breakout not affected
Clipping of the gradients too, to limit any one update the hard way if necessary (not sure how they do this or how real, they'd need to mess with compute gradients output.
They consider terminal state after LOSS OF LIFE not after episode ends. This is reported to have a large impact in the learning rate and in the score you can achieve (likely, because without it, it's hard to notice for the agent that losing a life is really very bad and not "free")

I really suspect the backprop is causing most of the trouble, and there are two potential problems with it.

The first (a) is regarding the Gather operation instead of the one_hot (that you mentioned, Danny), the second (b) is in asking the optimizer to think of vars in relation to multiple different examples (instead of vars we want fixed in general across an entire mini batch, which is usually how it will "think", that the same rules apply to the entire mini-batch).

a) The Gather operation operates very differently than a one hot. The Gather tells the optimizer to disregard completely the impact on the actions not taken as if they didn't affect the total cost. And it results in updates that disregard that we have placed a lot of effort on fine tuning the other variables because on the next state they may be the right action, and we want Q to provide a good estimate of those state, action values. But with Gather (instead of one hot) the optimizer reduces the cost in one action, without balancing the need to not change the value estimates for the other actions. So, regardless of how Gather works, a one hot is needed so that when the optimizer starts looking at the derivatives, it takes into account the other actions cost, and thus tries to find the direction which improves the current action cost (lowering it) while affecting the other action values (ie. current output value) as little as possible.

b) The other is regarding the mini-batch. I am not sure how TensorFlow mini-batch works, but it may be possible that it's not designed to have Variables corresponding to different examples within a single Batch. Another way to say it is that I think mini-batch assumes you don't change the variables to be fixed between examples within a batch, and one way to look at it is to notice how the optimizer always works across Trainable Vars, and the List of Variables that minimize takes as argument, most likely expect to be vars used across the mini match for each example. eg. you may be optimizing only x1 and not x2. But may not work if you want to optimize x1 in the first example of the batch and then x2 in the second example of the batch. This I only suspect must be so, as in many other context mini-batch is derived from a cost function that is static (not fixing vars across the batch).

I hope I am not saying something too stupid. If so apologies. Bottom line is that reward and gradient clipping (I just noticed tensorflow makes that easy to implement), ending episode when life is lost, using a one hot instead of gather, and ReLU at the final fully connected layer, may combined boost learning significantly, here, and in what may apply from here to a3c (then I care the most about).

Nov 16 '16 05:11 fferreres

Depending on how conv2d combines channel filters, maybe history is getting reduced to one state in the first convolutional layer. Maybe depthwise_conv2d is relevant or, alternatively, flattening the stack prior to feeding in to the network. I'm more comfortable with RNNs than with CNNs, so I could be off base here though :)

I just edited my response which was wrong. I really don't know how conv usually combines the channels. Simplest case is averaging, in cs231 hey say sum as in many other resources. The history isn't necessarily lost, and will be spread along the filters.

Nov 16 '16 06:11 fferreres

Unfortunately, it seems to have plateaued again. Currently at episode 3866. Here's the code I used:

fc1 = tf.contrib.layers.fully_connected(flattened, 512, activation_fn=tf.nn.relu, biases_initializer=tf.constant_initializer(0.0))
self.predictions = tf.contrib.layers.fully_connected(fc1, len(VALID_ACTIONS), biases_initializer=tf.constant_initializer(0.0))

Nov 16 '16 18:11 nerdoid

Is something like this possible? If the action is not taken, set the target value be equal to the current Q prediction for that action. For the (one) Action taken, set the target y value be the one of our target value q(s',a'). Or does it need a one hot? If possible, then

self.y_pl is filled with the TD target we calculated (using s',a') when the index corresponds to the action taken, and all other y targets (actions not chosen) are filled with just the predicted value from self.predictions (same value in Q vs target y). Then just:

# Calcualte the loss
        self.losses = tf.squared_difference(self.y_pl, self.predictions)
        self.loss = tf.reduce_mean(self.losses)

Nov 16 '16 21:11 fferreres

Thanks for the comments. Here are a few thoughts:

Clipping of rewards to range (1,-1) - I think breakout not affected

Good point. Likely to make a difference for some of the games.

Clipping of the gradients too, to limit any one update the hard way if necessary (not sure how they do this or how real, they'd need to mess with compute gradients output.

This won't make a difference in learning ability, it'll only help you avoid exploding gradients. So as long as you don't get NaN values for your gradients this is not necessary. It's easy to add though.

They consider terminal state after LOSS OF LIFE not after episode ends. This is reported to have a large impact in the learning rate and in the score you can achieve (likely, because without it, it's hard to notice for the agent that losing a life is really very bad and not "free")

Really good point and may be the key here. This could make a huge difference. I looked at the paper again and I actually can't find details on this. Can you point me to it?

But with Gather (instead of one hot) the optimizer reduces the cost in one action, without balancing the need to not change the value estimates for the other actions. So, regardless of how Gather works, a one hot is needed so that when the optimizer starts looking at the derivatives, it takes into account the other actions cost, and thus tries to find the direction which improves the current action cost (lowering it) while affecting the other action values (ie. current output value) as little as possible.

I don't think this is right. It is correct to only take into the account the current action. Remember that the goal of the value function approximator is the predict the total value of a state action pair, q(s,a). The value of an s,a pair is completely independent of other actions in the same state. In fact, the only reason our estimator even outputs values for all actions is because it's faster. In theory you could have a separate estimator for each actions (as it is done in regular Q-Learning). You may be thinking of policy-based methods like policy gradients that predict probabilities for all actions. There we need to take into account the actions that are not taken, but not here. I'm pretty confident the current code is correct.

The other is regarding the mini-batch. I am not sure how TensorFlow mini-batch works, but it may be possible that it's not designed to have Variables corresponding to different examples within a single Batch

I'm not sure if I understand this point. Can you explain more? However, I'm pretty confident that the optimization is right.

So I think it may be worth looking at the reward structure of the environment and what they do in the original paper...

I'll try to refactor the code on the weekend to look at the gradients in Tensorboard. That should give some insight into what's going on.

Nov 17 '16 04:11 dennybritz

Denny,

Really good point and may be the key here. This could make a huge difference. I looked at the paper again and I actually can't find details on this. Can you point me to it?

In the paper, under Methods, Training, they mention this:

For games where there is a life counter, the Atari 2600 emulator also sends the number of lives left in the game, which is then used to mark the end of an episode during training.

I've seen in some forums that they take this for granted, and they tried it in at least Breakout and it made a large difference.

SKIP THIS - It's just background on how I was confused...

Regarding the optimization, thanks for taking the time to explain. I highly appreciate it, and will sit on it and try to become more familiar with the concepts. My comment was inspired by comment by Karpathy that implemented REINFORCE.js and that in that page literally mentions under Bell & Whistles, Modeling Q(s,a):

Step 3. Forward fθ(st) and apply a simple L2 regression loss on the dimension a(t) of the output, to be y. This gradient will have a very simple form where the predicted value is simply subtracted from y. The other output dimensions have zero gradient.

But I see how I miss-read it. It confuses me a bit that having 10 outputs can be the same as having 10 networks dedicated to each action (not that both don't work, but how the 10 outputs can be as expressive).

I am also probably mixing things...such as when doing classification with softmax, we set labels for all classes, the wrong ones we send closer to 0 and the one that was right towards 1. So it trains the network to not only output something closer to one for the right label, but also trains the other labels to be closer to 0. In this case (DQN) I though similarly, that our "0" for the other actions was whatever value the function estimates (which is not relevant in the current state, but we don't want to change much because those action where not taken), thus I thought their equivalent to "0" was to set them to the value the function estimates, and the "1" was the cost we actually calculate for the action taken. I know I am still pulling my hair, ... :-)

Nov 17 '16 05:11 fferreres

For games where there is a life counter, the Atari 2600 emulator also sends the number of lives left in the game, which is then used to mark the end of an episode during training.

Hm, but that doesn't mean the end of an episode happens when a life is lost. It could also mean that they use the life counter to end the episode once the count reaches 0. It's a bit ambiguous. It's definitely worth trying out.

Nov 17 '16 05:11 dennybritz

I'm having a hard time wrapping my head around the impact of ending an episode at the loss of a life considering that training uses random samples of replay memory rather than just the current episode's experience. Starting a new episode is not intrinsically punishing, is it? A negative reward for losing a life is obviously desirable though. I'd love to be wrong because this would be an easy win if it does make a difference.

Nov 17 '16 06:11 nerdoid

On another note, I manually performed a few steps (fire, left, left) while printing out the processed state stack one depth (time step) at a time, and it checks out. The stacking code looks good to me.

Nov 17 '16 06:11 nerdoid

@nerdoid I'm not 100% sure either, but here are some thoughts:

Since the episode isn't ended you could potentially stack frames across lives. That doesn't seem right since there's a visual reset when you lose a life. But I agree this isn't likely to make a huge difference given that these samples from the replay memory should be pretty rate.
Intuitively, the agent isn't punished by losing a live because it knows that it can get rewards with the next life. Imagine a scenario where you are just about to lose a life. What should be Q value be? It depends on how many lives you have left. With no lives left the value should be 0, but with multiple lives left the value should be > 0. However, the estimator doesn't have a notion of how many lives are left (it's not even in the image) so it would make sense that it becomes confused about these kind of states. It becomes an easier learning problem when the target values of states just before losing a life are set to 0.

I know it's a bit far fetched of an explanation and I don't have a clear notion of how it affects things mathematically, but it's definitely worth trying.

Nov 17 '16 06:11 dennybritz

@fferreres I think your main confusion is that you are mixing up policy based methods (like REINFORCE and actor critic) with value-based methods like Q-Learning. In policy-based methods you do need to take all actions into account because the predictions are a probability distribution. However, in the DQN code we are not even using a softmax ;)

Nov 17 '16 06:11 dennybritz

Just to clarify the above: By not ending the episode when a life is lost you are basically teaching the agent that upon losing the visual state is reset and it can accumulate more reward in the same episode starting from scratch. In a way you are even encouraging it to get into states that lead to a loss of life because it learns that these states lead to a visual reset, but not end of episode, which is good for getting more reward. So I do believe this can make a huge difference. Does that make sense?

Nov 17 '16 07:11 dennybritz

Ahhh, yeah that makes a lot of sense. Thanks. I wonder then about the relative impact of doing a full episode restart vs. just reinitializing the state stack and anything else like that. Either way it's worth trying.

Although I do still have this nagging feeling that the environment could just give a large negative reward for losing a life. Is that incorrect?

Nov 17 '16 07:11 nerdoid

Thanks. I wonder then about the relative impact of doing a full episode restart vs. just reinitializing the state stack and anything else like that.

I think the key is that the targets are set to 0 when an life is lost, i.e. this line here:

targets_batch = reward_batch + np.invert(done_batch).astype(np.float32) * discount_factor * np.amax(q_values_next, axis=1)

So when a life is lost done should be true and the env should have a full reset.

Although I do still have this nagging feeling that the environment could just give a large negative reward for losing a life. Is that incorrect?

It could. I agree this should also improve things but the learning problem would still be confusing to the estimator. The Q value (no matter if negative or positive) depends on how many lives you have left, but that information is nowhere in the feature representation.

Nov 17 '16 07:11 dennybritz

I added a wrapper: https://github.com/dennybritz/reinforcement-learning/commit/18528d0dd743949e1555002be45af2053fc2b001#diff-14c9132835285c0345296e54d07e7eb2

I'll try to run this for A3C, would be great if you could try it for Q-Learning. Crossing my fingers that this fixes it ;)

Nov 17 '16 07:11 dennybritz

Awesome! Yes, I will give this a go for the Q-Learning in just a bit here.

Nov 17 '16 15:11 nerdoid

@dennybritz, yeah, I have things mixed up. Too many things to try to understand too quickly (new to Python, Numpy, TensorFlow, vim, Machine Learning, RL, refresh statistics/prob 101 :-)

Nov 17 '16 15:11 fferreres

Not sure if either of you have encountered this yet, but when a gym monitor is active, attempting to reset the environment when the episode is technically still running results in an error with message: "you cannot call reset() unless the episode is over." Here is the offending line.

A few ideas come to mind:

Wrap the env.reset() call in a try/except. If we end up in the except, it just means we lost a life rather than finishing. Skip resetting the state stack, and instead just continue like nothing happened. Important then to move the if done: break after the state = next_state at the end of the episode loop. We still benefit from the correct done flag being added to replay memory in the previous step.
Start and stop the monitor every record_video_every episodes. So we do the bookkeeping manually, and start the monitor only when we want to record a new video. There is probably overhead with starting and stopping the monitor repeatedly. IDK though.

The first option seems less hacky to me, but I might be overlooking something.

Nov 17 '16 23:11 nerdoid

Here are a few threads discussing low scores vs paper. It can be a source of troubleshooting ideas:

https://github.com/tambetm/simple_dqn/issues/32

https://groups.google.com/forum/m/#!topic/deep-q-learning/JV384mQcylo

https://docs.google.com/spreadsheets/d/1nJIvkLOFk_zEfejS-Z-pHn5OGg-wDYJ8GI5Xn2wsKIk/htmlview

I re-read all the DQN.py (not the notebook) and assuming all np slices and advanced indexing and functions operating over axis are ok, I still can't find what could be causing the score to flatten at ~35.

I think troubleshooting may be time consuming but it is also a good learning experience

Insome of the posts some mention doing a max among consecutive frames, some that the stacking is for consecutive frames I've seen a karpathy post explaining policy gradients and he did difference of frames.

This was reported regarding loss of life:

I regarded the loss of life as the terminal state but didn't reset the game, which is different from DeepMind. Without the modification (original simple_dqn), it was 155 test score in 22 epoch in kung_fu_master. With this modification, I got 9,500 test score in 22 epoch and 19,893 test score in 29 epoch.

Nov 18 '16 02:11 fferreres

Thanks for researching this. I made a bunch of changes to the DQN code on a separate branch. It now reset the env when losing a life, it writes gradient summaries and the graph to Tensorboard, and has gradient clipping: https://github.com/dennybritz/reinforcement-learning/compare/dqn?expand=1#diff-67242013ab205792d1514be55451053c

I'm re-running it now. If this doesn't "fix" it the in my opinion most likely source of error is the frame processing. Looking at the threads:

They use a fixed frame skip in the Atari environment. OpenAI gym defaults to a random frame skip in (3, 5). It seems like you can use a DeterministicBreakout-v0 environment to get a fixed frame skip.
It seems like they max-pool the colors over 4 frames, but the gym implementation seems to just skip frames and take the last one.
They extract luminance from the RGB data, I do grayscale (I have no idea if this is actually different)

I can't imagine that each of these makes a huge difference, but who knows. Here's the full paragraph:

Working directly with raw Atari 2600 frames, which are 2103 160
pixel images with a 128-colour palette, can be demanding in terms of computation
and memory requirements.We apply a basic preprocessing step aimed at reducing
the input dimensionality and dealing with some artefacts of the Atari 2600 emulator.
First, to encode a singlef rame we take the maximum valuefor each pixel colour
value over the frame being encoded and the previous frame. This was necessary to
remove flickering that is present in games where some objects appear only in even
frames while other objects appear only in odd frames, an artifact caused by the
limited number of sprites Atari 2600 can display at once. Second, we then extract
the Y channel, also known as luminance, from the RGB frame and rescale it to
84 x 84. The function w from algorithm 1 described below applies this preprocessing
to the m most recent frames and stacks them to produce the input to the
Q-function, in which m=4, although the algorithm is robust to different values of
m (for example, 3 or 5).

Nov 18 '16 03:11 dennybritz

Something entirely plausible is that we didn't try "hard enough". Look for example at breakout in this page:

https://github.com/Jabberwockyll/deep_rl_ale/wiki

It remains close to 0 even at 4 million steps, then starts climbing but beed millions and millions to make a difference. Also, in many of the kearning charts there are long periods of diminishing scores, but they recover later. It takes a million sometimes to do that.

Our tries are about 2-3 millions. I may have to check the reward curves from the paper, but maybe there's nothing wrong, if the plateu is actually just the policy taking a "nap". Unfortunately, I am stuck with an Apple Air (slow). Maybe with $10 I could try a3c on a 32 cpu processor...not very familiar with aws though but a spot instance for a day would do it

Nov 18 '16 03:11 fferreres

I'd stick with trying DQN instead of A3C for now - there are fewer potential sources of error.

If the implementation you linked to works correctly (it seems like it does) it should be easy to figure out what the difference is:

Reward/State Preprocessing
Network Architecture
Optimizer (They also used a different implementation of RMSProp)

Looking at it again there another difference from my code. They take 4 steps in the environment before making an update. I update after each step. That's not shown in the algorithm section in the paper, but it's mentioned in the appendix. So that's another thing to add.

Good point about the training time. If it takes 10M steps that's 2.5M updates (assuming you update only after 4 steps) and that'd be like 2-3 days looking at the current training speed I have on my GPU.

Nov 18 '16 04:11 dennybritz