async-rl icon indicating copy to clipboard operation
async-rl copied to clipboard

Continous control

Open muupan opened this issue 8 years ago • 6 comments

muupan avatar May 08 '16 10:05 muupan

I'm working on LSTM implementation (neon based) for the continuous case, sadly I failed to get any response from authors.

It is variance and entropy that puzzles me. Any thoughts on how that is implemented code-wise? Currently it shows no signs of convergence on mujoco domain for me and most likely there are errors in learnt variance for gaussian policy.

igrekun avatar May 14 '16 19:05 igrekun

Thanks for information. I haven't tried it yet, but the paper provides some information as below. Did you find it is not sufficient?

µ is modeled by a linear layer and σ2 by a SoftPlus operation, log(1 + exp(x)), as the activation computed as a function of the output of a linear layer.

we used a cost on the differential entropy of the normal distribution defined by the output of the actor network, −1/2 (log(2πσ2)+1), we used a constant multiplier of 10−4 for this cost across all of the tasks examined.

muupan avatar May 15 '16 03:05 muupan

It is a bit vague for me so I will try to summarize in order to be corrected : we need a fully connected layer outputting 2 values, add 1 softplus operation for second value (so that variance is > 0 I suppose), sample according to this gaussian (use numpy.randn * sigma + mu ?) in each dimension of action space, and finally send −1/2 (log(2πσ2)+1 as logprob instead of log(softmax) ?

etienne87 avatar Aug 03 '16 16:08 etienne87

hi, @muupan , do you have a plan to implement continous control? : )

loofahcus avatar Jan 04 '17 02:01 loofahcus

Here is an example:

class GaussianPolicyOutput(PolicyOutput):
    def __init__(self, logits_mu, logits_var):
        self.logits_mu = logits_mu
        self.logits_var = logits_var
        
        #print("self.logits_mu.data: ", self.logits_mu.data)
        
    @cached_property
    def action_indices(self):
        # the function has same name as for SoftmaxPolicyOutput so that the function
        # can be called from a3c.py without changes
        # however, the function samples from gaussian distributions
        
        mu, sigma2 = self.activation
        
        action = np.zeros(mu.data.shape, dtype = 'float32')
        
        #print("mu.data: ", mu.data)
        #print("sigma2.data: ", sigma2.data)
        for i in xrange(mu.data.shape[0]):
            action[i] = np.random.normal(mu.data[i], np.sqrt(sigma2.data[i]))
        #print("action: ", action)
        return action
    
    @cached_property
    def activation(self):
        mu = F.tanh(self.logits_mu) # output is in [-1,1]
        sigma2 = F.softplus(self.logits_var) #rectified output
        return mu, sigma2
        
    @cached_property
    def sampled_actions_log_probs(self):
        # returns chainer variable with log prob of the sampled action
    
        # activation
        mu, sigma2 = self.activation
        
        # sample action
        action = self.action_indices
        
        # compute neg. log likelihood
        #print("chainer.Variable(action).dtype: ", chainer.Variable(action).dtype)
        #print("mu.dtype: ", mu.dtype)
        #print("F.log(sigma2).dtype: ", F.log(sigma2).dtype)
        
        return -F.gaussian_nll(chainer.Variable(action), mu, F.log(sigma2))
    
    @cached_property
    def entropy(self):
        mu, sigma2 = self.activation
        return - F.sum(0.5*(np.log(2*np.pi*sigma2.data[0])+1))

haven't tested yet, so feel free to test/ correct

etienne87 avatar Feb 06 '17 18:02 etienne87

Thanks! @etienne87

loofahcus avatar Feb 07 '17 04:02 loofahcus