Reverse sign in TD-error of DQN
This is a little detail. I suggest to change the computation of the TD-error from td_error = q_t_selected - tf.stop_gradient(q_t_selected_target) to td_error = tf.stop_gradient(q_t_selected_target) - q_t_selected in https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/deepq/build_graph.py
I think this would be good because the TD-error is also returned as tensorboard metric directly (so some people will look at it as such). And in the literature, the TD-error is generally written as td_error = tf.stop_gradient(q_t_selected_target) - q_t_selected (i.e. true - predicted). E.g. see https://daiwk.github.io/assets/dqn.pdf (but also in Sutton and Barto it's done this way).
Hello, I would be in favor of that minor change (unless another maintainer objects) and would appreciate a PR that solves this issue ;)
I was actually looking at the tensorboard and getting kind of confused. Along the same lines, the loss shown in the tensorboard grows together with the episode reward. I would expect the loss to become smaller as the DQN gets better at predicting Q-values (and therefore achieving better episode reward). Am I missing something :) ?
I would expect the loss to become smaller as the DQN gets better at predicting Q-values (and therefore achieving better episode reward). Am I missing something :) ?
Please open another issue (after checking what I mentioned). But it sounds normal, you should take a look at fitted q iteration (ancestor of DQN) to understand why that might be normal ;)
In response to @mcapuccini, in my case I was looking at the scaled values of the TD-errors at first and they seemed to be going down. However, when I looked at the original scale, I saw that they were just becoming more and more negative (which also means that they got worse). That's why I started looking into how the TD-error is defined.
... Maybe it would be even better to pass the TD-error with absolute value to tensorboard?