stable-baselines Reverse sign in TD-error of DQN

This is a little detail. I suggest to change the computation of the TD-error from td_error = q_t_selected - tf.stop_gradient(q_t_selected_target) to td_error = tf.stop_gradient(q_t_selected_target) - q_t_selected in https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/deepq/build_graph.py

I think this would be good because the TD-error is also returned as tensorboard metric directly (so some people will look at it as such). And in the literature, the TD-error is generally written as td_error = tf.stop_gradient(q_t_selected_target) - q_t_selected (i.e. true - predicted). E.g. see https://daiwk.github.io/assets/dqn.pdf (but also in Sutton and Barto it's done this way).

Apr 16 '20 16:04 juliuskittler

Hello, I would be in favor of that minor change (unless another maintainer objects) and would appreciate a PR that solves this issue ;)

Apr 16 '20 16:04 araffin

I was actually looking at the tensorboard and getting kind of confused. Along the same lines, the loss shown in the tensorboard grows together with the episode reward. I would expect the loss to become smaller as the DQN gets better at predicting Q-values (and therefore achieving better episode reward). Am I missing something :) ?

Apr 17 '20 14:04 mcapuccini

I would expect the loss to become smaller as the DQN gets better at predicting Q-values (and therefore achieving better episode reward). Am I missing something :) ?

Please open another issue (after checking what I mentioned). But it sounds normal, you should take a look at fitted q iteration (ancestor of DQN) to understand why that might be normal ;)

Apr 17 '20 15:04 araffin

In response to @mcapuccini, in my case I was looking at the scaled values of the TD-errors at first and they seemed to be going down. However, when I looked at the original scale, I saw that they were just becoming more and more negative (which also means that they got worse). That's why I started looking into how the TD-error is defined.

... Maybe it would be even better to pass the TD-error with absolute value to tensorboard?

Apr 17 '20 16:04 juliuskittler