imitation icon indicating copy to clipboard operation
imitation copied to clipboard

Preference based Reinforcement Learning applies a "recurrent reward network" for solving a POMDP problem

Open CAI23sbP opened this issue 10 months ago • 0 comments

Problem

A Preference based Reinforcement learning at a POMDP problem. In paper, A author said that a reward model can apply a recurrent neural network for solving the POMDP problem.

Solution

I added a GRU for solving the POMDP problem. Please see my repo My main idea :

  1. BufferingWrapper and RewardVecEnvWrapper must be merged for saving hidden_state with observation, action and etc...
  2. To apply a Recurrent reward network ensembling, I generated hidden_states whose number are same to ensemble_size.

result

I applied this in BipedalWalker-v3 env with AbsorbAfterDoneWrapper from your sister project seals image

Addition

I added dict_preference.py for using dict type observation space.

CAI23sbP avatar Apr 24 '24 10:04 CAI23sbP