Tuomas Haarnoja comments

Results 16 comments of


                                            Tuomas Haarnoja

unstable training curve for default SQL

Hi, your reward scale might indeed be too large and that's why the final performance is quite poor. By instability you mean the spikes in the blue curve? It looks...

What does it mean to set 'eval_deterministic' = True for SQL?

Hi, thanks for your question. The deterministic mode you are referring to is a heuristic, and does not correspond to any optimal policy, but can sometimes yield a higher return...

What does it mean to set 'eval_deterministic' = True for SQL?

Not quite sure what you mean. SQL works only with expressive policies, like SVGD. If you use a more restrictive class of policies, like Gaussian, then the algorithm actually corresponds...

What does it mean to set 'eval_deterministic' = True for SQL?

Can you point me to the code?

Recording videos

Rendering videos is not currently supported, but it should be quite easy to add. Take a look at `sac` repository, which has the support: https://github.com/haarnoja/sac/blob/6b37e0165f5af549f2a6e463cc9b191ff8d62268/sac/misc/sampler.py#L41-L44

a mathematical problem ..

Thanks for the question. You mean Equation 13 in [this](https://arxiv.org/pdf/1801.01290.pdf) paper? It is the total derivative of J(\phi) with respect to the policy parameters \phi. Note that both \pi_\phi and...

Suboptimal policy

SQL learns maximum entropy policies, so that's why the optimal policy is stochastic. You can try for example annealing the temperature to zero, or shaping the reward function by making...

change to mujoco 1.5

Apologized for delayed response! We are working on improved version of SQL with better support for running things over ROS. We also plan to update the MuJoCo support at that...

About markovian environments

Hi Lior, The environment is reset explicitly by the sampler here: https://github.com/haarnoja/sac/blob/master/sac/misc/sampler.py#L133 I hope this answers your question! Cheers Tuomas

About markovian environments

Hi, we do bootstrap when the path length exceeds the maximum length, because reaching the time limit does not mean that we enter a terminal state. We don't bootstrap if...