dreamerv2
dreamerv2 copied to clipboard
Why does the sequences of rewards start at t-1?
Thanks for sharing the code, but I have a question. According to buffer.py.,here
def _shift_sequences(self, obs, actions, rewards, terminals):
obs = obs[1:]
actions = actions[:-1]
rewards = rewards[:-1]
terminals = terminals[:-1]
return obs, actions, rewards, terminals
I think you want to align states with rewards, but in trainer.py, here
obs, actions, rewards, terms = self.buffer.sample()
obs = torch.tensor(obs, dtype=torch.float32).to(self.device) # t, t+seq_len
actions = torch.tensor(actions, dtype=torch.float32).to(self.device) # t-1, t+seq_len-1
rewards = torch.tensor(rewards, dtype=torch.float32).to(self.device).unsqueeze(-1) # t-1 to t+seq_len-1
nonterms = torch.tensor(1-terms, dtype=torch.float32).to(self.device).unsqueeze(-1) # t-1 to t+seq_len-1
Why does the sequence of rewards start at t-1?
When prefilling the buffer, a transition (s_t, a_t, r_t+1, d_t+1) is pushed into the buffer, but the r_t+1 corresponds to the s_t+1, so
when calling the _shift_sequences
, the states and the rewards should be aligned, so I think the rewards may start at t rather than t - 1
Hi, in relation to this problem, I found env doesn't get an action of very first iteration.
In training loop,
prev_action = torch.zeros(1, trainer.action_size).to(trainer.device) # initialize
...
next_obs, rew, done, _ = env.step(action.squeeze(0).cpu().numpy()) # the first loop
For the Minatar enviroment, initialized action=0 might mean doing nothing so this issue might not have effect on results, but when applying for another environment, that may cause problem. I'm not sure so please tell me advice. Thank you!
I do not quite understand the _shift_sequences
function here, why we should shift the transition sequences.?