mbrl-lib
mbrl-lib copied to clipboard
[WIP] Add trajectory-based dynamics model
TODO for this WIP PR:
- [x] New PID based / linear feedback agent(s)
- [ ] Make PID accept vector inputs
- [x] Training example
- [ ] Migrate example to colab
- [ ] Add tests
Types of changes
- [ ] Docs change / refactoring / dependency upgrade
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
Motivation and Context / Related issue
I'm collaborating with some folks on Berkeley looking to apply the trajectory-based model to real world robotics, so I wanted to integrate it into this library to give it more longevity.
The paper is here. The core of the paper is proposing a long-term prediction focused dynamics model. The parametrization is:
$$ s_{t+1} = f_\theta(s_0, t, \phi),$$
where $\phi$ are closed form control parameters (e.g. PID)
Potentially this #66 , I think we will need to modify the replay buffer to
- store control parameter vector
- store time indices (which may be close with the trajectory formulation)
How Has This Been Tested (if it applies)
I am going to build a notebook to validate and demonstrate it, currently it is a fork of the PETS example. I will iterate
Checklist
- [ ] The documentation is up-to-date with the changes I made.
- [x] I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
- [ ] All tests passed, and additional code has been covered with new tests.
@luisenp do you know with the current replay buffer trajectory storing, if at training time it will be easy to get the "trajectory time index" corresponding to the step of the episode?
The trajectory-based model is trained on the full trajectory and each sub-trajectory of it, so I will need a tool to get all sub trajectories at the minimum. Happy to discuss on a call if it helps!
Also, I'm not sure what to do with pre-commits
, they're failing because I am using a different version of python I think?
gi[INFO] Installing environment for https://github.com/psf/black.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
t puAn unexpected error has occurred: CalledProcessError: command: ('/Users/nato/miniconda3/envs/mbrl/bin/python', '-mvirtualenv', '/Users/nato/.cache/pre-commit/repor4ja1kkq/py_env-python3.7', '-p', 'python3.7')
return code: 1
expected return code: 0
stdout:
RuntimeError: failed to find interpreter for Builtin discover of python_spec='python3.7'
Though if I change it, it wants me to stage them 😥
[ERROR] Your pre-commit configuration is unstaged.
git add .pre-commit-config.yaml to fix this.
@robertocalandra @albertwilcox may be interested.
Hi @natolambert, thanks for the PR! I have a few questions:
- I'm wondering if having a new replay buffer class for this is overkill, couldn't we just store the control parameter and time index as part of the state in a regular replay buffer?
- By getting "all subtrajectories", are you referring to all
N*(N+1) / 2
subtrajectories of a trajectory of lengthN
? My first though is that the cleanest would be to create a new iterator class that gets the full trajectories from the replay buffer, and then serves the subtrajectories in its loop.
Also, can you try checking out branch lep.change_precommit_yaml
and let me know if this solves the precommit issues you are having?
I really like the sounds of that, let me try and make those additions. It removes complexity in a way that I think is fitting.
This seems like a good direction, I'm realizing I will need to add another class similar to OneDTransitionRewardModel
because the prediction formulation is as follows (rather than one step):
$$ s_{t+h} = f_\theta(s_t, h, \psi)$$.
Not sure if I need a new MLP function, if I do it will just be for different trajectory propagation options (because it's no longer recursive, but rather all elements in one forward pass).
Will push more changes shortly. Relatedly, why is it called OneDTransitionRewardModel
? that's always confused me.
I guess maybe we can spoof it as follows: Inputs: state = state + time index, action = control parameters Outputs: state at future time index
With the right propagation function, nothing needs to change, including the replay buffer. We'll see if I can get it to work.
This seems like a good direction, I'm realizing I will need to add another class similar to
OneDTransitionRewardModel
because the prediction formulation is as follows (rather than one step):$$ s_{t+h} = f_\theta(s_t, h, \psi)$$.
Not sure if I need a new MLP function, if I do it will just be for different trajectory propagation options (because it's no longer recursive, but rather all elements in one forward pass).
Will push more changes shortly. Relatedly, why is it called
OneDTransitionRewardModel
? that's always confused me.I guess maybe we can spoof it as follows: Inputs: state = state + time index, action = control parameters Outputs: state at future time index
With the right propagation function, nothing needs to change, including the replay buffer. We'll see if I can get it to work.
TBH, I don't love the name OneDTransitionRewardModel
either. A previous version was called ProprioceptiveModel
, but an earlier user pointed out that in some cases the same type model could be used for inputs that wouldn't be considered proprioceptive. We brainstormed a bit on what a general name for this model would be, and this was the winning option. It essentially refers to the fact that the input to this model is a 1D array with [states, actions] and the output is a 1D array with [next_states, rewards].
Your feedback is making me think I may not actually need any of the new code I proposed other than some agent definitions for the environment I want to use.
Hacking all the input-output pairs to be correct has gotten pretty far. Let me finish this, then we can see if we want to add more (which will make it clearer for people who are interested). The clear difference is trajectory propagation when compared with the GaussianMLP
class.
Got "loss going down" which is always an exciting point. I will add more validation soon.
@luisenp I got the initial notebook done. I've actually set it up so I can reset the rest of the files, and just merge the notebook when we are happy with it. I think the difference in the MBRL-Lib reacher env vs the one used in the paper is causing some numerical instability, but the general principle is close to being validated!
Here's a long-term prediction accuracy comparing the one-step model to the traj-based model. I'm not 100% sure I'm predicting with the traj-based model right, because I couldn't use the wrappers. It predicts by passing an initial state, control parameters, and a vector of time indices.
Let me know if you look at the structure of the notebook!
Some things I want to get to:
- set up batched inference so I can use ensembles (any tips on this?),
- try a different environment,
- see if more data helps numerical stability.
Here's the plot comparing long-term prediction accuracy.
That's a nice plot! I'll take a closer look at this soon (hopefully tomorrow Friday, if I have time). Thanks!
@luisenp I guess the most relevant question is if any of the code in-the-works here interests you in the library or if we should just go notebook-only.
I ended up removing most of them. I can add a commit that reset's everything except for the notebook. I re-used all vanilla stuff in a way that could be slightly confusing. In the case when people are interested in this, we could add official code that makes it much simpler to use. Some of the things I'm using in potentially non-intended manner:
- trajectory storing in an extra replay buffer because trajectory-based model trains on sub-trajectories,
- putting control parameters + time horizon into actions (the naming discrepancy could be confusing for people),
- had to deconstruct
util.rollout_agent_trajectories
to store the data I needed (even if I just wanted one replay buffer).
I think that until the model gets more traction, notebook only is probably okay (I will reset all the other changes in the PR). It shows a good research use-case for how the library can be used flexibility.
If everything you need to run this example is in the notebook, then that's definitely a good starting point! In that case I can focus on reviewing the notebook more carefully. We don't need to remove the other files yet, maybe some of them will be useful later (thinking of the PID agent here).
That said, it would be useful if you can remove any superfluous stuff from the notebook, and maybe add some short text highlighting the parts I should focus more on when reviewing?
Yeah will do, in that case I will do a full pass to clean things up (including addressing the things in the PID agent and making it more compatible with the style guidelines / precommits).
@luisenp added tests, made batch friendly, and cleaned everything.
After our discussion today, I think I am happy with this as a compromise:
- transfer the notebook to colab, and link to colab in readme + link to source in my mbrl-lib fork (after minor improvements).
- make this PR just for PID control.
If it gets solid uptake, we can polish it further. Thoughts?
Hi Nathan, that sounds good, I'm OK with this plan.
Colab is here: https://colab.research.google.com/drive/15lodC9KyzzQCv9hQY3wtAe-yYOdk9vZB?usp=sharing My dev repo is here: https://github.com/natolambert/mbrl-lib-dev/tree/traj-model (will merge into my main branch when we're done here)
Any thoughts on adding a vanilla reward function, like no_termination
(for unrolling states where you don't care about reward, or for in reacher where there is no reward function right now).
def no_reward(act: torch.Tensor, next_obs: torch.Tensor) -> torch.Tensor:
return torch.Tensor(0)