mppi_pendulum
mppi_pendulum copied to clipboard
Sample from approximate dynamics
Are there any plans to extend this to approximated dynamics (e.g. with a NN) and using importance sampling instead of sampling trajectories directly from the environment?
(replace __init__ env arg with a dynamics arg, then take in env for just the control method for actual stepping)
That would actually match the contributions from the 2017 paper and make it more broadly applicable. I would like to use this in an environment where I can't reset the state of the simulator, so trajectories have to be generated with the model.
Currently, I am not planning to implement this extension anytime soon. Please feel free to do so. A pull-request would be highly appreciated!
Ok I'll look into it - let me list what I think the changes need to be: I plan to make these in a new file to preserve the current file's example of what to do if you could actually sample from the environment. I might also implement a generic MPPI package afterwards
easy stuff
- [x] take in approximate dynamics
- [x] take in running state cost and terminal state cost functions
- [x] take in trainer method (it'll concatenate data and retrain the approximate dynamics)
- [x] move env as a parameter to
controlmethod
stuff I'm not sure about
- [x] resample noise after each control iteration (get a control command to execute); keeping the same noise might be problematic? Probably best to use an option
- [x] caching action cost coefficient
lambda * sigma^{-1} * noise
Something problematic is that the paper assumes a state-dependent running cost, whereas pendulum's gym task has cost theta^2 + 0.1*theta_dt^2 + 0.001*action^2 (https://github.com/openai/gym/wiki/Pendulum-v0)
I think the formulation doesn't change much if we allow q(x,u) to replace q(x).
Actually since you're using reward from env which is q(x,u) anyway, it's not changing anything
EDIT3: It's able to work somewhat reliably now after gathering some data (around ~500 steps). I had to normalize the angle before feeding into the network because the env doesn't normalize it, so I ran into the issue of being in totally unsupported domain when the pendulum wrapped around. Even though the MSE from the network can get to ~O(0.0001), it's not able to stabilize around the top indefinitely; maybe you can take a look why?

EDIT2: It sometimes learns to swing up, but doesn't know how to stay up; maybe you can take a look to see if it's expected behaviour: https://github.com/LemonPi/mppi_pendulum EDIT: fixed the being stuck on one side problem (due to unnormalized theta in cost function) Now it's just struggling to get past the halfway swing up
OK what's really strange is that using the same model structure and random seed, my pytorch implementation of the generic MPPI (generalized to other problem dimensions) applied to the pendulum problem achieves stability (also it's 70 times faster (even on the CPU) on 100 samples due to batch sampling):

There's either a bug in the numpy implementation, or there's weird interaction with pytorch from how I set it up...