Description

Creates an entropy reward replay wrapper to support the unsupervised state entropy based pre-training of an agent, as described in the PEBBLE paper. https://sites.google.com/view/icml21pebble

Testing

Added unit tests.

Nov 11 '22 19:11 dan-pandori

Thanks for the implementations!

Nov 14 '22 15:11 yawen-d

Codecov Report

Merging #625 (b344cbd) into master (1dd4c8f) will increase coverage by 0.10%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #625      +/-   ##
==========================================
+ Coverage   97.51%   97.62%   +0.10%     
==========================================
  Files          85       88       +3     
  Lines        8316     8698     +382     
==========================================
+ Hits         8109     8491     +382     
  Misses        207      207

Impacted Files	Coverage Δ
src/imitation/policies/base.py	`100.00% <ø> (ø)`
src/imitation/algorithms/pebble/entropy_reward.py	`100.00% <100.00%> (ø)`
src/imitation/algorithms/preference_comparisons.py	`99.18% <100.00%> (+0.04%)`	:arrow_up:
src/imitation/policies/replay_buffer_wrapper.py	`100.00% <100.00%> (ø)`
src/imitation/scripts/common/rl.py	`97.40% <100.00%> (-0.04%)`	:arrow_down:
...ion/scripts/config/train_preference_comparisons.py	`88.29% <100.00%> (+2.96%)`	:arrow_up:
.../imitation/scripts/train_preference_comparisons.py	`97.87% <100.00%> (+0.99%)`	:arrow_up:
src/imitation/util/networks.py	`97.08% <100.00%> (+0.04%)`	:arrow_up:
src/imitation/util/util.py	`99.13% <100.00%> (+0.14%)`	:arrow_up:
tests/algorithms/pebble/test_entropy_reward.py	`100.00% <100.00%> (ø)`
... and 5 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Dec 02 '22 09:12 codecov[bot]

@AdamGleave: reacting to your comments here together:

I'd prefer wrapping it with a NormalizedRewardNet, they're conceptually doing very different things, and we might want to use different normalization schemes (RunningNorm often works worse than EMANorm)

Ok, it required a larger refactor, but you can see how it looks in the last couple of commits.

A good thing is that this change also addresses your other comment. It simplified the entropy reward classes (separate entropy reward and switching from pre-traininig reward) and allows for more configurability, at the expense of making wiring a little more complicated (in train_preference_comparison.py).

It also results in two changes internally:

Previously, the running mean/var statistics for normalization were first updated, then normalization was applied. Now these are swapped.
Previously, reward reward calculation required conversions numpy -> torch -> numpy, now it internally converts numpy -> torch -> numpy -> torch -> numpy (because that's what the existing code for NormalizedRewardNet does). Though this applies just for pretraining.

Dec 10 '22 21:12 feynmanix

Building blocks for PEBBLE

Description

Testing

Codecov Report