dreamerv2 Why does the sequences of rewards start at t-1?

Why does the sequences of rewards start at t-1?

Open xjh1020 opened this issue 2 years ago • 2 comments

Thanks for sharing the code, but I have a question. According to buffer.py.，here

def _shift_sequences(self, obs, actions, rewards, terminals):
        obs = obs[1:]
        actions = actions[:-1]
        rewards = rewards[:-1]
        terminals = terminals[:-1]
        return obs, actions, rewards, terminals

I think you want to align states with rewards, but in trainer.py, here

obs, actions, rewards, terms = self.buffer.sample()
obs = torch.tensor(obs, dtype=torch.float32).to(self.device)                         # t, t+seq_len
actions = torch.tensor(actions, dtype=torch.float32).to(self.device)                 # t-1, t+seq_len-1
rewards = torch.tensor(rewards, dtype=torch.float32).to(self.device).unsqueeze(-1)   # t-1 to t+seq_len-1
nonterms = torch.tensor(1-terms, dtype=torch.float32).to(self.device).unsqueeze(-1)  # t-1 to t+seq_len-1

Why does the sequence of rewards start at t-1? When prefilling the buffer, a transition (s_t, a_t, r_t+1, d_t+1) is pushed into the buffer, but the r_t+1 corresponds to the s_t+1, so when calling the _shift_sequences, the states and the rewards should be aligned, so I think the rewards may start at t rather than t - 1

Jun 08 '22 15:06 xjh1020

Hi, in relation to this problem, I found env doesn't get an action of very first iteration. In training loop, prev_action = torch.zeros(1, trainer.action_size).to(trainer.device) # initialize ... next_obs, rew, done, _ = env.step(action.squeeze(0).cpu().numpy()) # the first loop For the Minatar enviroment, initialized action=0 might mean doing nothing so this issue might not have effect on results, but when applying for another environment, that may cause problem. I'm not sure so please tell me advice. Thank you!

Dec 16 '22 09:12 miniyosshi

I do not quite understand the _shift_sequences function here， why we should shift the transition sequences.?

May 05 '23 07:05 return-sleep

dreamerv2 dreamerv2 copied to clipboard

Why does the sequences of rewards start at t-1?

dreamerv2
dreamerv2 copied to clipboard