mppi_pendulum icon indicating copy to clipboard operation
mppi_pendulum copied to clipboard

Sample from approximate dynamics

Open LemonPi opened this issue 5 years ago • 5 comments

Are there any plans to extend this to approximated dynamics (e.g. with a NN) and using importance sampling instead of sampling trajectories directly from the environment? (replace __init__ env arg with a dynamics arg, then take in env for just the control method for actual stepping)

That would actually match the contributions from the 2017 paper and make it more broadly applicable. I would like to use this in an environment where I can't reset the state of the simulator, so trajectories have to be generated with the model.

LemonPi avatar Dec 25 '19 20:12 LemonPi

Currently, I am not planning to implement this extension anytime soon. Please feel free to do so. A pull-request would be highly appreciated!

ferreirafabio avatar Dec 26 '19 15:12 ferreirafabio

Ok I'll look into it - let me list what I think the changes need to be: I plan to make these in a new file to preserve the current file's example of what to do if you could actually sample from the environment. I might also implement a generic MPPI package afterwards

easy stuff

  • [x] take in approximate dynamics
  • [x] take in running state cost and terminal state cost functions
  • [x] take in trainer method (it'll concatenate data and retrain the approximate dynamics)
  • [x] move env as a parameter to control method

stuff I'm not sure about

  • [x] resample noise after each control iteration (get a control command to execute); keeping the same noise might be problematic? Probably best to use an option
  • [x] caching action cost coefficient lambda * sigma^{-1} * noise

LemonPi avatar Dec 26 '19 17:12 LemonPi

Something problematic is that the paper assumes a state-dependent running cost, whereas pendulum's gym task has cost theta^2 + 0.1*theta_dt^2 + 0.001*action^2 (https://github.com/openai/gym/wiki/Pendulum-v0) I think the formulation doesn't change much if we allow q(x,u) to replace q(x). Actually since you're using reward from env which is q(x,u) anyway, it's not changing anything

LemonPi avatar Dec 26 '19 19:12 LemonPi

EDIT3: It's able to work somewhat reliably now after gathering some data (around ~500 steps). I had to normalize the angle before feeding into the network because the env doesn't normalize it, so I ran into the issue of being in totally unsupported domain when the pendulum wrapped around. Even though the MSE from the network can get to ~O(0.0001), it's not able to stabilize around the top indefinitely; maybe you can take a look why?

ezgif com-optimize

EDIT2: It sometimes learns to swing up, but doesn't know how to stay up; maybe you can take a look to see if it's expected behaviour: https://github.com/LemonPi/mppi_pendulum EDIT: fixed the being stuck on one side problem (due to unnormalized theta in cost function) Now it's just struggling to get past the halfway swing up

LemonPi avatar Dec 26 '19 20:12 LemonPi

OK what's really strange is that using the same model structure and random seed, my pytorch implementation of the generic MPPI (generalized to other problem dimensions) applied to the pendulum problem achieves stability (also it's 70 times faster (even on the CPU) on 100 samples due to batch sampling):

pendulum_pytorch

There's either a bug in the numpy implementation, or there's weird interaction with pytorch from how I set it up...

LemonPi avatar Dec 27 '19 03:12 LemonPi