dreamer-pytorch
dreamer-pytorch copied to clipboard
Why are observation_embed and action at the same “t” in the rollout_representation function?
Hi, I'm confused... In rnns.py,there is a function as follows:
def rollout_representation(self, steps: int, obs_embed: torch.Tensor, action: torch.Tensor,
prev_state: RSSMState):
priors = []
posteriors = []
for t in range(steps):
prior_state, posterior_state = self.representation_model(obs_embed[t], action[t], prev_state)
priors.append(prior_state)
posteriors.append(posterior_state)
prev_state = posterior_state
prior = stack_states(priors, dim=0)
post = stack_states(posteriors, dim=0)
return prior, post
According to the original formula in paper, the input of the representation model should be the action at the previous moment and the obs_embed at the current moment? So why is it the same moment here <prior_state, posterior_state = self.representation_model(obs_embed[t], action[t], prev_state)>? Maybe I missed some details, please help me to resolve my confusion. Thank you.
It is already considered in here. https://github.com/juliusfrost/dreamer-pytorch/blob/47bd509ab5cffa95ec613fd788d7ae1fe664ecd5/dreamer/algos/dreamer_algo.py#L194-L196
Hi, shouldn't it be
observation = samples.all_observation[:-1] # [t, t+batch_length+1] -> [t, t+batch_length]
action = samples.all_action[:-1] # [t-1, t+batch_length] -> [t-1, t+batch_length-1]
reward = samples.all_reward[1:] # [t-1, t+batch_length] -> [t, t+batch_length]
so that
self.representation_model(obs_embed[t], action[t], prev_state)
will be $p(s_t | s_{t-1}, a_{t-1})$ for the prior and $p(s_t | s_{t-1}, a_{t-1}, o_t)$ for the posterior.
Current code is computing $p(s_t | s_{t-1}, a_t)$ for the prior and $p(s_t | s_{t-1}, a_t, o_t)$ for the posterior. Did I miss something?
all_observation is observation. not state. check the comment in lines :)
Hi sorry maybe I was not clear. My question was about indexing the action. The code is
observation = samples.all_observation[:-1] # [t, t+batch_length+1] -> [t, t+batch_length]
action = samples.all_action[1:] # [t-1, t+batch_length] -> [t, t+batch_length]
reward = samples.all_reward[1:] # [t-1, t+batch_length] -> [t, t+batch_length]
reward = reward.unsqueeze(2)
done = samples.done
done = done.unsqueeze(2)
# Extract tensors from the Samples object
# They all have the batch_t dimension first, but we'll put the batch_b dimension first.
# Also, we convert all tensors to floats so they can be fed into our models.
lead_dim, batch_t, batch_b, img_shape = infer_leading_dims(observation, 3)
# squeeze batch sizes to single batch dimension for imagination roll-out
batch_size = batch_t * batch_b
# normalize image
observation = observation.type(self.type) / 255.0 - 0.5
# embed the image
embed = model.observation_encoder(observation)
prev_state = model.representation.initial_state(batch_b, device=action.device, dtype=action.dtype)
# Rollout model by taking the same series of actions as the real model
prior, post = model.rollout.rollout_representation(batch_t, embed, action, prev_state)
which means embed
is $o_{t:t+K}$ and action is $a_{t:t+K}$ (judging by the comment in the code). Don't we need $a_{t-1:t+K-1}$ instead?
no. action is $a_{t-1 : t+K-1}$. observation sequence timestep is like [t, t+batch_length+1]
by [:-1]
and action sequence timestep is like [t-1, t+batch_length]
by [1:]
Because the comment for observation is
# [t, t+batch_length+1] -> [t, t+batch_length]
and for action is
# [t-1, t+batch_length] -> [t, t+batch_length]
So that's why I thought it was wrong because both are $o_{t:t+K}$ and $a_{t:t+K}$. I said I was not really sure as well because I was not sure about the replay buffer sampling. Thanks for the confirmation!
Okay. Every code is fine. Depending on where you cut the array, you can create data starting from t-1 or data starting from t. Thanks.