Tuomas Haarnoja
Tuomas Haarnoja
Hi, your reward scale might indeed be too large and that's why the final performance is quite poor. By instability you mean the spikes in the blue curve? It looks...
Hi, thanks for your question. The deterministic mode you are referring to is a heuristic, and does not correspond to any optimal policy, but can sometimes yield a higher return...
Not quite sure what you mean. SQL works only with expressive policies, like SVGD. If you use a more restrictive class of policies, like Gaussian, then the algorithm actually corresponds...
Can you point me to the code?
Rendering videos is not currently supported, but it should be quite easy to add. Take a look at `sac` repository, which has the support: https://github.com/haarnoja/sac/blob/6b37e0165f5af549f2a6e463cc9b191ff8d62268/sac/misc/sampler.py#L41-L44
Thanks for the question. You mean Equation 13 in [this](https://arxiv.org/pdf/1801.01290.pdf) paper? It is the total derivative of J(\phi) with respect to the policy parameters \phi. Note that both \pi_\phi and...
SQL learns maximum entropy policies, so that's why the optimal policy is stochastic. You can try for example annealing the temperature to zero, or shaping the reward function by making...
Apologized for delayed response! We are working on improved version of SQL with better support for running things over ROS. We also plan to update the MuJoCo support at that...
Hi Lior, The environment is reset explicitly by the sampler here: https://github.com/haarnoja/sac/blob/master/sac/misc/sampler.py#L133 I hope this answers your question! Cheers Tuomas
Hi, we do bootstrap when the path length exceeds the maximum length, because reaching the time limit does not mean that we enter a terminal state. We don't bootstrap if...