Building blocks for PEBBLE
Description
Creates an entropy reward replay wrapper to support the unsupervised state entropy based pre-training of an agent, as described in the PEBBLE paper. https://sites.google.com/view/icml21pebble
Testing
Added unit tests.
Thanks for the implementations!
Codecov Report
Merging #625 (b344cbd) into master (1dd4c8f) will increase coverage by
0.10%. The diff coverage is100.00%.
@@ Coverage Diff @@
## master #625 +/- ##
==========================================
+ Coverage 97.51% 97.62% +0.10%
==========================================
Files 85 88 +3
Lines 8316 8698 +382
==========================================
+ Hits 8109 8491 +382
Misses 207 207
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/imitation/policies/base.py | 100.00% <ø> (ø) |
|
| src/imitation/algorithms/pebble/entropy_reward.py | 100.00% <100.00%> (ø) |
|
| src/imitation/algorithms/preference_comparisons.py | 99.18% <100.00%> (+0.04%) |
:arrow_up: |
| src/imitation/policies/replay_buffer_wrapper.py | 100.00% <100.00%> (ø) |
|
| src/imitation/scripts/common/rl.py | 97.40% <100.00%> (-0.04%) |
:arrow_down: |
| ...ion/scripts/config/train_preference_comparisons.py | 88.29% <100.00%> (+2.96%) |
:arrow_up: |
| .../imitation/scripts/train_preference_comparisons.py | 97.87% <100.00%> (+0.99%) |
:arrow_up: |
| src/imitation/util/networks.py | 97.08% <100.00%> (+0.04%) |
:arrow_up: |
| src/imitation/util/util.py | 99.13% <100.00%> (+0.14%) |
:arrow_up: |
| tests/algorithms/pebble/test_entropy_reward.py | 100.00% <100.00%> (ø) |
|
| ... and 5 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@AdamGleave: reacting to your comments here together:
I'd prefer wrapping it with a NormalizedRewardNet, they're conceptually doing very different things, and we might want to use different normalization schemes (RunningNorm often works worse than EMANorm)
Ok, it required a larger refactor, but you can see how it looks in the last couple of commits.
A good thing is that this change also addresses your other comment. It simplified the entropy reward classes (separate entropy reward and switching from pre-traininig reward) and allows for more configurability, at the expense of making wiring a little more complicated (in train_preference_comparison.py).
It also results in two changes internally:
- Previously, the running mean/var statistics for normalization were first updated, then normalization was applied. Now these are swapped.
- Previously, reward reward calculation required conversions numpy -> torch -> numpy, now it internally converts numpy -> torch -> numpy -> torch -> numpy (because that's what the existing code for NormalizedRewardNet does). Though this applies just for pretraining.