[Preference Comparison] Active learning from ensemble
The original deep RL from human preferences paper uses an ensemble of reward models. It then selects queries for comparison that have the highest disagreement between models, a proxy for epistemic uncertainty. Specifically, section 2.2.4 states:
We decide how to query preferences based on an approximation to the uncertainty in the reward function estimator, similar to Daniel et al. (2014): we sample a large number of pairs of trajectory segments of length k, use each reward predictor in our ensemble to predict which segment will be preferred from each pair, and then select those trajectories for which the predictions have the highest variance across ensemble members.
The ablation in section 3.3 (random queries) shows active learning significantly improves results in some environments (e.g. swimmer, qbert). However, it makes little difference in most environments, and even hurts performance in a handful (e.g. breakout). Nonetheless, it seems useful to have support for as it does often improve performance and is a useful starting point for more sophisticated active learning approaches.
https://github.com/HumanCompatibleAI/imitation/pull/460 adds support for reward ensembles. Given this, I think the remaining work is "just" to add support for using that uncertainty for query selection during preference comparison. I think the natural way to add support for this is to implement a new Fragmenter that performs active selection, rather than the current uniform random RandomFragmenter.