machine_learning_examples
machine_learning_examples copied to clipboard
Update after terminal state
I think there's a little bug in many of your scripts in that you update the returns for the last step with a post-terminal step. Thus, your value (policy) functions wind up growing (unbounded?) near the terminal state. For example, in rl2/mountaincar you have a "train" boolean but it is never set to false for the last step.
Hmm.. I only found this train flag in one file (pg_theano) which was just a remnant from an old version (not being used). Could you elaborate on what you were referring to?
Actually there is an issue I found which is most scripts don't consider the value of the terminal state to be 0 (and hence the return is just the reward), but that doesn't sound like what you're referring to.
It's been awhile since I thought about this but I think my kludge fix of not updating on the last step is effectively (but not precisely) setting the value of the terminal state to be zero. Setting the value of the terminal state to zero will fix the fundamental issue that the value function grows to extremely large values (i.e. much much larger than the maximum possible reward).