imitation
imitation copied to clipboard
Preference based Reinforcement Learning applies a "recurrent reward network" for solving a POMDP problem
Problem
A Preference based Reinforcement learning at a POMDP problem. In paper, A author said that a reward model can apply a recurrent neural network for solving the POMDP problem.
Solution
I added a GRU for solving the POMDP problem. Please see my repo My main idea :
-
BufferingWrapper
andRewardVecEnvWrapper
must be merged for savinghidden_state
withobservation
,action
and etc... - To apply a
Recurrent reward network ensembling
, I generatedhidden_states
whose number are same to ensemble_size.
result
I applied this in BipedalWalker-v3
env with AbsorbAfterDoneWrapper
from your sister project seals
Addition
I added dict_preference.py
for using dict type observation space.