machine_learning_examples Update after terminal state

Update after terminal state

Open ericmock opened this issue 7 years ago • 2 comments

I think there's a little bug in many of your scripts in that you update the returns for the last step with a post-terminal step. Thus, your value (policy) functions wind up growing (unbounded?) near the terminal state. For example, in rl2/mountaincar you have a "train" boolean but it is never set to false for the last step.

Nov 19 '17 21:11 ericmock

Hmm.. I only found this train flag in one file (pg_theano) which was just a remnant from an old version (not being used). Could you elaborate on what you were referring to?

Actually there is an issue I found which is most scripts don't consider the value of the terminal state to be 0 (and hence the return is just the reward), but that doesn't sound like what you're referring to.

Feb 04 '18 00:02 lazyprogrammer

It's been awhile since I thought about this but I think my kludge fix of not updating on the last step is effectively (but not precisely) setting the value of the terminal state to be zero. Setting the value of the terminal state to zero will fix the fundamental issue that the value function grows to extremely large values (i.e. much much larger than the maximum possible reward).

Feb 04 '18 04:02 ericmock

machine_learning_examples machine_learning_examples copied to clipboard

Update after terminal state

machine_learning_examples
machine_learning_examples copied to clipboard