imitation Preference based Reinforcement Learning applies a "recurrent reward network" for solving a POMDP problem

Preference based Reinforcement Learning applies a "recurrent reward network" for solving a POMDP problem

Open CAI23sbP opened this issue 10 months ago • 0 comments

Problem

A Preference based Reinforcement learning at a POMDP problem. In paper, A author said that a reward model can apply a recurrent neural network for solving the POMDP problem.

Solution

I added a GRU for solving the POMDP problem. Please see my repo My main idea :

BufferingWrapper and RewardVecEnvWrapper must be merged for saving hidden_state with observation, action and etc...
To apply a Recurrent reward network ensembling, I generated hidden_states whose number are same to ensemble_size.

result

I applied this in BipedalWalker-v3 env with AbsorbAfterDoneWrapper from your sister project seals

Addition

I added dict_preference.py for using dict type observation space.

Apr 24 '24 10:04 CAI23sbP

imitation imitation copied to clipboard

Preference based Reinforcement Learning applies a "recurrent reward network" for solving a POMDP problem

Problem

Solution

result

Addition

imitation
imitation copied to clipboard