MetaIRL icon indicating copy to clipboard operation
MetaIRL copied to clipboard

A question on the info loss

Open pengzhenghao opened this issue 2 years ago • 0 comments

In this file: https://github.com/ermongroup/MetaIRL/blob/master/inverse_rl/models/info_airl_state_train.py#L141

We learn that the info loss is the negative log likelihood of q_\psi( m | expert trajectory)


# Get "m" distribution by feeding expert trajectory
context_dist_info_vars = self.context_encoder.dist_info_sym(expert_traj_var)

# Sample a "m" from the distribution
context_mean_var = context_dist_info_vars["mean"]
context_log_std_var = context_dist_info_vars["log_std"]
eps = tf.random.normal(shape=tf.shape(context_mean_var))
reparam_latent = eps * tf.exp(context_log_std_var) + context_mean_var

# Compute the log probability of the sampled "m" in its own distribution
log_q_m_tau = tf.reshape(self.context_encoder.distribution.log_likelihood_sym(reparam_latent, context_dist_info_vars) ...

info_loss = - tf.reduce_mean(log_q_m_tau ...

However, this is different from the equation shown in the original paper. In the below paragraph, in my understanding, you do the following things:

  1. Use expert trajectory to get a "m distribution": q_\psi( \cdot | expert trajectory)
  2. Use the "m distribution" to sample a context vector "m"
  3. Feed the "m" to the policy and get agent trajectory "tau_agent"
  4. Feed the agent trajectory to the posterior distribution to get an estimated "m'": q_\psi(m' | agent trajectory)
  5. Compute the log probability of the estimated "m'" in the original "m distribution" estimated from expert trajectory.
image

However, in the code above, you are simply computing the log probability of a sampled "m" on its original distribution q_\psi( \cdot | expert trajectory) estimated from the expert trajectory. That is, you skip the step 3 and 4.

I want to know if this is intended and if my understanding is correct.

Thank you!

pengzhenghao avatar Feb 26 '23 09:02 pengzhenghao