imitation Support asynchronous human preference gathering in RLHP implementation

Description

We implemented asynchronous human preference gathering for RLHP, which includes an API change. See #696 for more details.
Our PR implements a REST-interface for general Web-UIs and we provide a simple Django-web-service that already works with this.
This PR also integrates synchronous human preference gathering via CLI/Notebook from PR #712, which is fully compatible with our implementation.
- Another difference is that we only generate videos for the requested fragments instead of for the entire environment interaction.

Testing

We added extensive unit tests in test_preference_comparisons.py Tests for 2 utility methods are currently missing and some existing tests still need fixing.

May 15 '23 21:05 mschweizer

@ernestum just lettting you know that this is ready for first review. Also we think that this might be a breaking change due to the changed signature of the "gatherer" and introduction of the "querent".

Jun 25 '23 21:06 rk1a

@rk1a thanks for this huge chuck of work! We are excited to see how we can upstream it.

Right now we are focused on releasing a v1.0 of imitation by the end of July. We plan to introduce the asynchronous human preference gathering after that release. So please bear with us because we do not have the capacity to review in depth and integrate such a big change right now but be assured that we have it on our radar.

I had a quick glance over the changes.

I think changing the signature of the gatherer is no breaking change since it is just an internal class anyways. Requiring an explicit querent to construct the PreferenceComparisons class, however, is a breaking change. Maybe we can adapt it so some default querent is created when it is not provided?

To me it seem to split up the gatherer into gatherer and querent. I am not 100% sure that I understand the bounds of the responsibilities of the two classes. Especially when I was just reading the docstring:

preference_querent: queries preferences between trajectory fragments.
preference_gatherer: gathers preferences between trajectory fragments.

it is not really clear to me.

I am not sure if you observed this in your experiments but in the long run we probably need to introduce a mechanism to avoid the RL algorithm to train for too long without any fresh new human preferences. Otherwise, there is the danger of model "burn in": the model over-fits the learned reward model and starts to cheat.

@Rocamonde will have a deeper look at your PR to give you some still preliminary but more in-depth feedback in the coming days.

Jul 04 '23 09:07 ernestum

Hi Rocamonde,

Thanks for your feedback.

Naming
- We are happy to change the name “Querent” to something more common, like inquirer.
- PrefCollect is the name of the Service that we have used as Web GUI for preference collection. We plan to make this part more generic and remove the reference to the name of that particular service accordingly.
Split into Gatherer and Querent
- We will change our design and integrate the querent functionality into the gatherer.
- We will follow up with details as soon as we implement this change.
Asynchronous Preference Collection and Controllable Learning
- We acknowledge that this asynchronous design results in a less controlled learning process.
- We believe that there is no solution that allows a completely controlled learning process and fast training.
- In our opinion, the trade-off between these two conflicting objectives should be made by the user. For this, we would like to allow the user to choose the degree of control based, for example, on a percentage of queries that are unresolved. The range would go from completely controlled, i.e. training is blocked if any unresolved queries exist, to completely uncontrolled, i.e. training proceeds irrespective of the number of unresolved queries.
- Our approach can be extended to include this functionality.
- As a side note, the implementation of Christiano et al. was asynchronous with parallel, non-blocking processes for each, RL, preference collection and reward model training.
Agreed :-)

Aug 04 '23 14:08 mschweizer

Hi @Rocamonde
we have changed our design and integrated the querent functionality into the gatherer. We added the two methods gather() and query() to the gatherer and removed its call() method. gather() now fulfills the role of the call() method. We made the querent a component of the gatherer and query() calls its querent component’s query method. Currently, the implemented gatherer classes use a hardcoded querent component. We are aware that there are alternative ways of designing this: inject an object or callable for query(); injecting an object is similar to our new design because our design could be changed to accept the querent component as a constructor argument (with default value). use multiple inheritance for query() and possibly gather() implement gather() and query() directly in each gatherer class (and use inheritance where it is needed)Side note: we waited with the renaming of the

We would like to get your input on our new design and these alternative options. What would be the advantages/disadvantages given a future refactoring of the algorithm?

Nov 06 '23 15:11 mschweizer

imitation imitation copied to clipboard

Support asynchronous human preference gathering in RLHP implementation

Description

Testing

imitation
imitation copied to clipboard