dreamer-pytorch icon indicating copy to clipboard operation
dreamer-pytorch copied to clipboard

Why are observation_embed and action at the same “t” in the rollout_representation function?

Open thj926 opened this issue 3 years ago • 7 comments

Hi, I'm confused... In rnns.py,there is a function as follows:

def rollout_representation(self, steps: int, obs_embed: torch.Tensor, action: torch.Tensor,
                           prev_state: RSSMState):
    priors = []
    posteriors = []
    for t in range(steps):
        prior_state, posterior_state = self.representation_model(obs_embed[t], action[t], prev_state)
        priors.append(prior_state)
        posteriors.append(posterior_state)
        prev_state = posterior_state
    prior = stack_states(priors, dim=0)
    post = stack_states(posteriors, dim=0)
    return prior, post

According to the original formula in paper, the input of the representation model should be the action at the previous moment and the obs_embed at the current moment? So why is it the same moment here <prior_state, posterior_state = self.representation_model(obs_embed[t], action[t], prev_state)>? Maybe I missed some details, please help me to resolve my confusion. Thank you.

thj926 avatar Aug 25 '21 13:08 thj926

It is already considered in here. https://github.com/juliusfrost/dreamer-pytorch/blob/47bd509ab5cffa95ec613fd788d7ae1fe664ecd5/dreamer/algos/dreamer_algo.py#L194-L196

seolhokim avatar Feb 13 '23 04:02 seolhokim

Hi, shouldn't it be

observation = samples.all_observation[:-1]  # [t, t+batch_length+1] -> [t, t+batch_length] 
action = samples.all_action[:-1]            # [t-1, t+batch_length] -> [t-1, t+batch_length-1] 
reward = samples.all_reward[1:]             # [t-1, t+batch_length] -> [t, t+batch_length] 

so that

self.representation_model(obs_embed[t], action[t], prev_state)

will be $p(s_t | s_{t-1}, a_{t-1})$ for the prior and $p(s_t | s_{t-1}, a_{t-1}, o_t)$ for the posterior.

Current code is computing $p(s_t | s_{t-1}, a_t)$ for the prior and $p(s_t | s_{t-1}, a_t, o_t)$ for the posterior. Did I miss something?

gunnxx avatar Oct 04 '23 04:10 gunnxx

all_observation is observation. not state. check the comment in lines :)

seolhokim avatar Oct 10 '23 03:10 seolhokim

Hi sorry maybe I was not clear. My question was about indexing the action. The code is

        observation = samples.all_observation[:-1]  # [t, t+batch_length+1] -> [t, t+batch_length]
        action = samples.all_action[1:]  # [t-1, t+batch_length] -> [t, t+batch_length]
        reward = samples.all_reward[1:]  # [t-1, t+batch_length] -> [t, t+batch_length]
        reward = reward.unsqueeze(2)
        done = samples.done
        done = done.unsqueeze(2)

        # Extract tensors from the Samples object
        # They all have the batch_t dimension first, but we'll put the batch_b dimension first.
        # Also, we convert all tensors to floats so they can be fed into our models.

        lead_dim, batch_t, batch_b, img_shape = infer_leading_dims(observation, 3)
        # squeeze batch sizes to single batch dimension for imagination roll-out
        batch_size = batch_t * batch_b

        # normalize image
        observation = observation.type(self.type) / 255.0 - 0.5
        # embed the image
        embed = model.observation_encoder(observation)

        prev_state = model.representation.initial_state(batch_b, device=action.device, dtype=action.dtype)
        # Rollout model by taking the same series of actions as the real model
        prior, post = model.rollout.rollout_representation(batch_t, embed, action, prev_state)

which means embed is $o_{t:t+K}$ and action is $a_{t:t+K}$ (judging by the comment in the code). Don't we need $a_{t-1:t+K-1}$ instead?

gunnxx avatar Oct 10 '23 03:10 gunnxx

no. action is $a_{t-1 : t+K-1}$. observation sequence timestep is like [t, t+batch_length+1] by [:-1]and action sequence timestep is like [t-1, t+batch_length] by [1:]

seolhokim avatar Oct 10 '23 04:10 seolhokim

Because the comment for observation is

# [t, t+batch_length+1] -> [t, t+batch_length]

and for action is

# [t-1, t+batch_length] -> [t, t+batch_length]

So that's why I thought it was wrong because both are $o_{t:t+K}$ and $a_{t:t+K}$. I said I was not really sure as well because I was not sure about the replay buffer sampling. Thanks for the confirmation!

gunnxx avatar Oct 10 '23 04:10 gunnxx

Okay. Every code is fine. Depending on where you cut the array, you can create data starting from t-1 or data starting from t. Thanks.

seolhokim avatar Oct 10 '23 05:10 seolhokim