imitation
imitation copied to clipboard
Support simple, synchronous, CLI or Jupyter human preference collection
Problem
Today only synthetic preferences are supported. It would be great to support real human preferences.
Solution
Requirements:
- record videos of trajectories
- ideally, extensible so we could factor out into a more elaborate, asynchronous service someday
- simple enough that it can be built in a day
MVP:
- In a Jupyter notebook or CLI
- Synchronous
Steps:
- [ ] Inject option to store videos into training code (looks like I can just use VecEnv?)
- [ ] Build new interface/class for storing videos (so we could configure this to store in different directories or on the cloud, for example)
- [ ] Build new gatherer that requests user feedback
- [ ] Display videos to users in Jupyter or CLI
- [ ] Clean up watched/unneeded videos
- [ ] Build demo notebook
- [ ] Integrate into training script and test end to end
Possible alternative solutions
Slightly more than MVP (out of scope for this issue):
- Refactor to support asynchronous preference gathering
- Separate requesting preferences from receiving them
- Periodically retrain with new preferences
- Would be nice to have a way to indicate that new preferences are available
- Would require changing fragmenter possibly? E.g. "just show the most recent pair to the user" rather than all pairs
- Build asynchronous