IQ-Learn
IQ-Learn copied to clipboard
Pseudocode and questions
Hey thanks for sharing this work! And I really appreciate the in depth beginner friendly blog post! I was wondering if this pseudocode was
- Correct
- Helpful to anyone else trying to understand the code
If not feel free to close. But I would appreciate it if you could help me understand a few parts about the code! Thanks!
Questions
- How come the environment reward env_reward is unused and reward is entirely dependent on the output of the model? Does this algorithm only learn the expert and never take into account environment reward?
- Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse?
Pseudocode
def init_network():
q_net = torch.nn.Linear(state_size, action_size)
target_net = deepcopy(q_net)
def episode_step():
action = softmax(q_net(state))
next_state, reward = env.step(action)
memory.add((state, next_state, action, reward)) # memory = collections.deque
update_critic(memory, expert_memory)
target_net = deepcopy(q_net)
def update_critic(memory, expert_memory):
# The idea here is that we backprop both the rewards for the expert's actions and the agent's actions
# the batch dimension contains examples from the expert and the agent
state = torch.cat((memory[:][0], expert_memory[:][0]))
next_state = torch.cat((memory[:][1], expert_memory[:][1]))
action = torch.cat((memory[:][2], expert_memory[:][2]))
# v = sum of future rewards for all possible actions given current state
v = torch.logsumexp(q_net(state), dim=1, keepdim=True)
# next_v = sum of future rewards for all possible actions given state(t+1)
next_v = torch.logsumexp(q_net(next_state), dim=1, keepdim=True)
# q = sum of future rewards predicted given current state, action pair
q = q_net(state).gather(action)
loss = iq_loss(q, v, next_v)
critic_optimizer.zero_grad()
loss.backward()
critic_optimizer.step()
def iq_loss(q, v, next_v):
if done:
expert_reward = q[where_expert]
# Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse?
value_loss = v.mean()
else:
expert_reward = (q - next_v)[where_expert]
value_loss = (v - next_v).mean()
# Why is this negative?
expert_reward_loss = -expert_reward.mean()
loss = reward_loss + value_loss
return loss