imitation icon indicating copy to clipboard operation
imitation copied to clipboard

Implemented the ability to train rewards in preference comparison against multiple policies

Open levmckinney opened this issue 2 years ago • 6 comments

Description

This pull request adds the MixtureOfTrajectoryGenerators class which allows preference comparison to be trained on data from multiple sources. This includes the ability to train against multiple agents and train against a mixture of data from an agent and a dataset.

To facilitate this I have made modifications to HeirachicalLogger class allowing for arbitrary prefixes to be added so that different trajectory generators logs don't clobber each other.

Testing

Changes to the HeirachicalLogger class have been unit tested. MixtureOfTrajectoryGenerators has also been united test. I've also included new integration tests for the train_preference_comparison.py script to test the use case of having multiple policies.

levmckinney avatar Aug 17 '22 05:08 levmckinney

Codecov Report

Merging #529 (408e248) into master (91c66b7) will increase coverage by 0.03%. The diff coverage is 99.29%.

:exclamation: Current head 408e248 differs from pull request most recent head 4df8551. Consider uploading reports for the commit 4df8551 to get more accurate results

@@            Coverage Diff             @@
##           master     #529      +/-   ##
==========================================
+ Coverage   97.16%   97.19%   +0.03%     
==========================================
  Files          85       85              
  Lines        7646     7769     +123     
==========================================
+ Hits         7429     7551     +122     
- Misses        217      218       +1     
Impacted Files Coverage Δ
src/imitation/algorithms/preference_comparisons.py 99.04% <97.82%> (-0.13%) :arrow_down:
...ion/scripts/config/train_preference_comparisons.py 85.71% <100.00%> (+0.38%) :arrow_up:
.../imitation/scripts/train_preference_comparisons.py 97.22% <100.00%> (+0.50%) :arrow_up:
src/imitation/util/logger.py 100.00% <100.00%> (ø)
tests/algorithms/test_preference_comparisons.py 100.00% <100.00%> (ø)
tests/scripts/test_scripts.py 100.00% <100.00%> (ø)
tests/util/test_logger.py 100.00% <100.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Aug 18 '22 21:08 codecov[bot]

@levmckinney let me know if there's any particular areas you'd like my feedback on, I'll aim to try and do a high-level review but may not have time to look at it in detail.

@Rocamonde can you do a style/technical review on Monday?

AdamGleave avatar Aug 21 '22 03:08 AdamGleave

When reading through the logging code, I see we are maintaining a somewhat advanced and general logging infrastructure for hierarchical logging. I'm surprised there's not an already existing way to handle this without a bespoke implementation. What are other people doing? (i.e. in similar ML projects) Are our logging needs very specific and different to the way this is normally done? It might be interesting to spin off the logging module into e.g. an SB3 PR, so people don't have to reinvent the wheel every time.

Rocamonde avatar Aug 22 '22 13:08 Rocamonde

When reading through the logging code, I see we are maintaining a somewhat advanced and general logging infrastructure for hierarchical logging. I'm surprised there's not an already existing way to handle this without a bespoke implementation. What are other people doing? (i.e. in similar ML projects) Are our logging needs very specific and different to the way this is normally done? It might be interesting to spin off the logging module into e.g. an SB3 PR, so people don't have to reinvent the wheel every time.

What we are doing is kind of weird -- you don't normally have several RL trainers that all need to log to the same place -- but it could come up in some other applications, e.g. population-based training. You could look at how rllib handles this.

AdamGleave avatar Aug 23 '22 03:08 AdamGleave

@levmckinney let me know if there's any particular areas you'd like my feedback on, I'll aim to try and do a high-level review but may not have time to look at it in detail.

@AdamGleave Could you look at the changes to the HierarchicalLogger?

levmckinney avatar Aug 24 '22 06:08 levmckinney

@Rocamonde I think Lev requested re-review, could you take another look soon? Thanks!

AdamGleave avatar Sep 02 '22 05:09 AdamGleave

What's the status of this? I borrowed from this PR partially in other PRs that have been merged to avoid blockage / reinventing the wheel, so it probably requires some re-writing.

Rocamonde avatar Oct 25 '22 16:10 Rocamonde

@levmckinney, what's outstanding here? I know you're busy now, but I can likely find someone else to finish the PR off.

However, IIRC the policy ensemble didn't help performance much, so perhaps it's served its purpose (of running those experiments) and we should just cherry-pick any helpful features from it then close this PR.

AdamGleave avatar Oct 25 '22 19:10 AdamGleave

@levmckinney, what's outstanding here? I know you're busy now, but I can likely find someone else to finish the PR off.

However, IIRC the policy ensemble didn't help performance much, so perhaps it's served its purpose (of running those experiments) and we should just cherry-pick any helpful features from it then close this PR.

I think we can close this. The main thing we might want to cherry-pick out is the changes to the hierarchical logger. However, that only really matters if we are planning to move forward with this logging framework.

levmckinney avatar Nov 18 '22 02:11 levmckinney