Zhihan Yang

Results 9 comments of Zhihan Yang

@araffin A single training run of an RL agent generates several monitor files (in my case, 3). Why is that? In other words, what does `stable_baselines3.common.monitor.load_results` do with them? I...

There are several levels for my answer: - Strangely, I had to change the method names from `_action` and `_reverse_action` to `action` and `reverse_action` for the code to work -...

I think this is weird, too. ```{python} agent.memory.append( observation, agent.select_action(observation), 0., False ) ``` Also, done is set to False is this tuple, which is more perplexing.

Having said so, I think this would probably have a negligible effect in terms of learning, given that the replay buffer is so big, but I think it's good for...

I think the answer is yes. ```{python} policy_loss = -self.critic([ to_tensor(state_batch), self.actor(to_tensor(state_batch)) ]) policy_loss = policy_loss.mean() policy_loss.backward() self.actor_optim.step() ``` First of all, I think it is clear that we are...

@danijar That's actually an interesting perspective. But the mean can also be negative, right? If that's the case, the second term actually makes all actions more likely. So it's a...

Thanks a lot for the detailed response! I'm still in the process of understanding the derivation of the ELBO. Are there any helpful resources that I should consult?

I think this is a valid concern. Making state information available to the critic makes this implementation incorrect.

Did you reproduce this bad performance on multiple datasets? If so, which ones?