imitation icon indicating copy to clipboard operation
imitation copied to clipboard

Building blocks for PEBBLE

Open dan-pandori opened this issue 3 years ago • 3 comments

Description

Creates an entropy reward replay wrapper to support the unsupervised state entropy based pre-training of an agent, as described in the PEBBLE paper. https://sites.google.com/view/icml21pebble

Testing

Added unit tests.

dan-pandori avatar Nov 11 '22 19:11 dan-pandori

Thanks for the implementations!

yawen-d avatar Nov 14 '22 15:11 yawen-d

Codecov Report

Merging #625 (b344cbd) into master (1dd4c8f) will increase coverage by 0.10%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #625      +/-   ##
==========================================
+ Coverage   97.51%   97.62%   +0.10%     
==========================================
  Files          85       88       +3     
  Lines        8316     8698     +382     
==========================================
+ Hits         8109     8491     +382     
  Misses        207      207              
Impacted Files Coverage Δ
src/imitation/policies/base.py 100.00% <ø> (ø)
src/imitation/algorithms/pebble/entropy_reward.py 100.00% <100.00%> (ø)
src/imitation/algorithms/preference_comparisons.py 99.18% <100.00%> (+0.04%) :arrow_up:
src/imitation/policies/replay_buffer_wrapper.py 100.00% <100.00%> (ø)
src/imitation/scripts/common/rl.py 97.40% <100.00%> (-0.04%) :arrow_down:
...ion/scripts/config/train_preference_comparisons.py 88.29% <100.00%> (+2.96%) :arrow_up:
.../imitation/scripts/train_preference_comparisons.py 97.87% <100.00%> (+0.99%) :arrow_up:
src/imitation/util/networks.py 97.08% <100.00%> (+0.04%) :arrow_up:
src/imitation/util/util.py 99.13% <100.00%> (+0.14%) :arrow_up:
tests/algorithms/pebble/test_entropy_reward.py 100.00% <100.00%> (ø)
... and 5 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Dec 02 '22 09:12 codecov[bot]

@AdamGleave: reacting to your comments here together:

I'd prefer wrapping it with a NormalizedRewardNet, they're conceptually doing very different things, and we might want to use different normalization schemes (RunningNorm often works worse than EMANorm)

Ok, it required a larger refactor, but you can see how it looks in the last couple of commits.

A good thing is that this change also addresses your other comment. It simplified the entropy reward classes (separate entropy reward and switching from pre-traininig reward) and allows for more configurability, at the expense of making wiring a little more complicated (in train_preference_comparison.py).

It also results in two changes internally:

  • Previously, the running mean/var statistics for normalization were first updated, then normalization was applied. Now these are swapped.
  • Previously, reward reward calculation required conversions numpy -> torch -> numpy, now it internally converts numpy -> torch -> numpy -> torch -> numpy (because that's what the existing code for NormalizedRewardNet does). Though this applies just for pretraining.

feynmanix avatar Dec 10 '22 21:12 feynmanix