Tuomas Haarnoja comments

Results 16 comments of


                                            Tuomas Haarnoja

what is "sandbox"

Hi Termset, it's this one: [https://github.com/rll/rllab/tree/master/sandbox](https://github.com/rll/rllab/tree/master/sandbox). Note that this repo is not actively maintained anymore. I recommend you to use [softlearning](https://github.com/rail-berkeley/softlearning) repo instead, which includes the most up-to-date version of...

paper/code conflict: using minimum Q in policy gradient

Good catch! We actually tried both versions and did not find much difference between them. We'll fix the code in the next release.

maximization bias

Hi, we indeed use the same data to update both of the Q-functions. I haven't tested splitting the data and using different sets for different Q's, but I'm guessing that...

action distribution for estimating V

Thanks for your question. We use uniform sampling because there is no direct way to evaluate the log-probabilities of action of SVGD policies, which would be needed for the importance...

action distribution for estimating V

I see, that's indeed confusing. You are right in that we could compute the log probs if the sampling network is invertible. My feeling is that, in our case, the...

action distribution for estimating V

Do you mean the expectation over states and actions in Eq. (11)? It is OK, since the corresponding gradient estimator is unbiased, though it can have high variance.