Alternative cdqn P parameterization
Testing an alternative parameterization of P. It is just x*x^T + epsilon*eye. The eye is added after the dot instead of before. The diagonals are all squared values plus epsilon, so are >0.
Also added argparse to the cdqn example so it can train a new model or test an existing model. It looks like a ton of changes but it is mostly just changing the indentation.
I want to make sure to test the change thoroughly. Need to make sure it learns and runs as quickly and is at least as stable. Any ideas on benchmarking the actual performance of the cdqn?
Cheers
@matthiasplappert I need to rebase the tests if we end up using this parameterization. First need to confirm whether it fixes NaN issues or not.
It seems to be learning the cdqn pendulum correctly without rescaling, if that means anything.
I'm hesitant to remove the original parameterization since this is what the authors in the original paper use. Not saying that it is necessarily the best, but I'd keep DDPG as close to the original paper as possible to allow for proper benchmarking.
However, I don't have a problem with including a different parameterization if it works better. Can we maybe just turn this into an option, e.g. introduce an additional parameter parameterization in the constructor, which can be set to different string values to use different modes.
@matthiasplappert I was thinking about making it an alternative. Need to test this versus the version with the moved epsilon. Seems to be fixing the NaNs but I want to play with it some more. No rush. Trying to get a big RL project up and running so I'll just see what version works the best.