edward icon indicating copy to clipboard operation
edward copied to clipboard

KLqp does not work with multiple variables

Open dtolpin opened this issue 6 years ago • 5 comments

bw = Exponential(0.1)
y =  Exponential([1. / bw, 1. / bw])
ys =  tf.stack([Exponential(1. / bw), Exponential(1. / bw)])

Both y and ys have the same distribution. Metropolis-Hastings works for both:

y_inference = ed.MetropolisHastings({bw: sampled_bandwidth}, 
                                         {bw: proposed_bandwidth},
                                         data={y: numpy.array([100., 100.])})
y_inference = ed.MetropolisHastings({bw: sampled_bandwidth}, 
                                         {bw: proposed_bandwidth},
                                         data={ys: numpy.array([100., 100.])})

KLqp works for y but not for ys. For ys it just learns the prior.

shape = tf.get_variable("shape", (), initializer=tf.constant_initializer(2.))
scale = tf.get_variable("scale", (), initializer=tf.constant_initializer(10.))
qbandwidth = Gamma(tf.nn.softplus(shape), 1. / tf.nn.softplus(scale))
    
y_inference = ed.KLqp({bw: qbandwidth}, {y: numpy.array([100., 100.])})
ys_inference = ed.KLqp({bw: qbandwidth}, {ys: numpy.array([100., 100.])})

dtolpin avatar Aug 01 '18 14:08 dtolpin

The result of ys = tf.stack([Exponential(1. / bw), Exponential(1. / bw)]) is not a random variable anymore but a tf.Tensor. KLqp only calculates the log probability of an item in the data dictionary if it is an ed.RandomVariable (since it needs to be able to call the function .log_prob). In your case it won't take ys into account because it is a tf.Tensor. That is probably why it just learns the prior.

bkayalibay avatar Aug 01 '18 17:08 bkayalibay

This explains why this happens, but this is still a bug (or an ill-designed feature). Metropolis-Hastings works on this model. ReparameterizationKLqp works on this model. Then you have a general purpose inference algorithm KLqp which, when applied to this model, silently gives wrong results.

I distilled the simplified example above from a complicated model (which I quote here just for completeness - below). There is no easy way to figure out that the result given by the inference is wrong just because the algorithm does not work. Leaving behavior like this in the system makes Edward quite useless for anything elaborated.

def update_beliefs(beliefs, i, j, bandwidth):
    
    # updates the beliefs with new evidence
    update = tf.scatter_nd(tf.stack([tf.stack([i , j])]),
                           tf.constant([1.]),
                           beliefs.shape)
    beliefs = beliefs + update
    
    # compute new evidence in the updated row
    evidence = tf.reduce_sum(beliefs[i, :])
    
    # if the evidence is greater than the bandwidth,
    # scale down
    scale = bandwidth / evidence
    beliefs = tf.cond(scale < 1.,
                      lambda: beliefs * 
                              tf.exp(tf.scatter_nd(tf.stack([tf.stack([i, tf.constant(0)]),
                                                             tf.stack([i, tf.constant(1)])]),
                                                   tf.log(tf.stack([scale, scale])),
                                                   beliefs.shape)),
                      lambda: beliefs)
    return beliefs

def model(bandwidth, page_count, number_of_sessions, data):
    churn_probability = 1 / PAGES_PER_SESSION_PRIOR
    beliefs = tf.stack([2 * churn_probability * tf.ones(page_count),
                        2 * (1 - churn_probability) * tf.ones(page_count)],
                       axis=1)

    def over_sessions(state, isession):
       
        def sample_over_pages(beliefs, ipage, last_page):
            alpha = beliefs[ipage, 0]
            beta = beliefs[ipage, 1]
            last_page = tf.logical_or(tf.equal(Bernoulli(alpha / (alpha + beta)), 1),
                                      tf.equal(ipage, page_count - 1))
            return (beliefs, ipage + 1, last_page)

        def observe_over_pages(beliefs, ipage, last_page):
            last_page = tf.logical_or(tf.equal(ipage, data[isession] - 1),
                                      tf.equal(ipage, page_count - 1))
            beliefs = update_beliefs(beliefs, ipage, 
                                     tf.cond(last_page, lambda: 0, lambda: 1),
                                     bandwidth) 
            return (beliefs, ipage + 1, last_page)

        def continues(beliefs, ipage, last_page):
            return tf.logical_not(last_page)

 
        beliefs, _ = state
        beliefs, _, _ = tf.while_loop(continues, observe_over_pages, (beliefs, 0, tf.constant(False)))
        _, pps, _ = tf.while_loop(continues, sample_over_pages, (beliefs, 0, tf.constant(False)))
        
        return beliefs, pps
    
    beliefs, sessions = tf.scan(over_sessions, tf.range(number_of_sessions), (beliefs, 0))

    return sessions

dtolpin avatar Aug 01 '18 17:08 dtolpin

So calling ys_inference = ed.ReparameterizationKLqp({bw: qbandwidth}, {ys: numpy.array([100., 100.])}) works? I'm surprised by that since it should have the same problem as plain ed.KLqp due to this if-statement. So if that works I don't see why ed.KLqp doesn't. It might be that my initial guess was wrong and the issue is somewhere else.

bkayalibay avatar Aug 01 '18 17:08 bkayalibay

Just wanted to say quickly that I got a similar problem to @dtolpin in a pretty simple hierarchical linear model. Calling ed.KLqp learned basically a degenerate distribution for the gaussian mean-field parameters where every mean was 0 and simply exploded the variance of each term (which seemed to align roughly with the values I set for the priors). Solely changing ed.KLqp to ed.ReparameterizationKLqp (terrible name btw ha) actually managed to get my parameters to converge nicely. Took me a longggg time to figure this out though and I was just lucky enough to trip upon this thread early on. If it helps my model was something like this:

$\hat{irr}=\beta_1 Crops + \beta_2 spei + \beta_b + \epsilon_1$ With $\epsilon_1 \sim N(\beta^T X_1, 1) $ $\hat{well}=well_{lag} + \gamma_1 \hat{irr} +\gamma_2 spei + \epsilon_2$ With $\epsilon_2 \sim N(well_{lag}+\gamma^T X_2, 1) $

Pardon my lack of github-latex knowledge. If it's helpful I can post my code too.

M-Harrington avatar Dec 17 '18 23:12 M-Harrington

Edit: Looking back I now realize it is much more likely to do with the variance-decreasing qualities of the reparam gradient than a problem with the software.

M-Harrington avatar Dec 17 '18 23:12 M-Harrington