edward icon indicating copy to clipboard operation
edward copied to clipboard

Second-order score gradients

Open Bonnevie opened this issue 6 years ago • 7 comments

This is admittedly an esoteric issue. The score gradient is cleverly implemented in Edward with the step

q_grads = tf.gradients(
      -(tf.reduce_mean(q_log_prob * tf.stop_gradient(losses)) - reg_penalty),
      q_vars)

ensuring that the result is of the (pseudo-code) form mean(score*losses).

Now say that I want to define an operation which takes the score gradient an an input. If I try to take the gradient of this derived expression, the result will be wrong due to the stop_gradientop. Is there a clever idiomatic way to define the score gradient without compromising its derivative?

Note that the score could be computed prior to taking the product with losses, but since Tensorflow can only compute derivatives of scalar quantities, this would involve unstacking and looping over q_log_prob.

Thinking it over, maybe the simplest way to attack the problem is to use graph modification to swap the tf.stop_gradient(losses) node with a pure losses?

edit: for the ones who don't find this an interesting intellectual pursuit in and of itself, I can note that this becomes fairly relevant if one wants to calculate the variance gradient for the REBAR and RELAX estimators used in discrete variational approximations.

Bonnevie avatar Mar 25 '18 16:03 Bonnevie

Is there a clever idiomatic way to define the score gradient without compromising its derivative?

Yes, there is! I was just chatting with Jakob Foerster last week about getting DiCE (https://arxiv.org/abs/1802.05098) in Edward. I don't know his github—ccing @alshedivat, @rockt who also worked on it. Contributions are welcome.

dustinvtran avatar Mar 25 '18 19:03 dustinvtran

@dustinvtran Ah, did see DICE when it came out, and I looked it over again this friday hoping that it would solve my problem in an instant, but I think this might be a different problem? With DICE the goal is to build unbiased estimators of higher order derivatives, while here, the goal is to take the derivative of an existing first-order estimator. I can see how my title might be a tad misleading in that respect.

Bonnevie avatar Mar 25 '18 20:03 Bonnevie

Right, it depends on what you're taking derivatives of—exact first-order gradients (which DiCe solves) or the first-order gradient estimator.

For the latter, have you seen Edward2's klqp implementation? It avoids tf.stop_gradient altogether by building a "scale factor", which is local to the stochastic node and not global like tf.stop_gradient.

https://github.com/blei-lab/edward/blob/feature/2.0/edward/inferences/klqp.py#L36

dustinvtran avatar Mar 25 '18 20:03 dustinvtran

That's a slightly dense implementation, might need a few pointers. Is the idea to have the surrogate_loss do an implicit stop_gradient by swapping in the probability calculated using x.value instead of x? Doesn't that lock the node to x.value at definition time?

edit: For the record, using graph_editor seems to work although it seems less elegant.

Bonnevie avatar Mar 25 '18 21:03 Bonnevie

Yes, so DiCE let's you define an objective such that the gradient of the objective is an estimator of the gradient. This holds for arbitrary orders of derivatives, so you don't have to worry about how to differentiate the estimator.

I think I understand your use case though and I agree that it's not obvious that DiCE would solve this out of the box.

jakobnicolaus avatar Mar 25 '18 21:03 jakobnicolaus

This pyro DiCE implementation & use_case might be informative: https://github.com/uber/pyro/blob/684c909c7f66ced5408d4ea01dff9259d8b19bd2/pyro/infer/util.py#L109 https://github.com/uber/pyro/blob/ec8714f36de26d11c4d87155f68ba1e3d1868f2d/pyro/infer/traceenum_elbo.py#L17

ethancaballero avatar Apr 12 '18 22:04 ethancaballero

Thanks for the pointer! From this:

        for model_trace, guide_trace in self._get_traces(model, guide, *args, **kwargs):
            elbo_particle = _compute_dice_elbo(model_trace, guide_trace)
            if is_identically_zero(elbo_particle):
                continue

            elbo += elbo_particle.item() / self.num_particles

it would appear that they don't have a clever way of vectorizing over samples.

Bonnevie avatar Apr 13 '18 12:04 Bonnevie