edward
edward copied to clipboard
Second-order score gradients
This is admittedly an esoteric issue. The score gradient is cleverly implemented in Edward with the step
q_grads = tf.gradients(
-(tf.reduce_mean(q_log_prob * tf.stop_gradient(losses)) - reg_penalty),
q_vars)
ensuring that the result is of the (pseudo-code) form mean(score*losses)
.
Now say that I want to define an operation which takes the score gradient an an input. If I try to take the gradient of this derived expression, the result will be wrong due to the stop_gradient
op. Is there a clever idiomatic way to define the score gradient without compromising its derivative?
Note that the score could be computed prior to taking the product with losses
, but since Tensorflow can only compute derivatives of scalar quantities, this would involve unstacking and looping over q_log_prob
.
Thinking it over, maybe the simplest way to attack the problem is to use graph modification to swap the
tf.stop_gradient(losses)
node with a pure losses
?
edit: for the ones who don't find this an interesting intellectual pursuit in and of itself, I can note that this becomes fairly relevant if one wants to calculate the variance gradient for the REBAR and RELAX estimators used in discrete variational approximations.
Is there a clever idiomatic way to define the score gradient without compromising its derivative?
Yes, there is! I was just chatting with Jakob Foerster last week about getting DiCE (https://arxiv.org/abs/1802.05098) in Edward. I don't know his github—ccing @alshedivat, @rockt who also worked on it. Contributions are welcome.
@dustinvtran Ah, did see DICE when it came out, and I looked it over again this friday hoping that it would solve my problem in an instant, but I think this might be a different problem? With DICE the goal is to build unbiased estimators of higher order derivatives, while here, the goal is to take the derivative of an existing first-order estimator. I can see how my title might be a tad misleading in that respect.
Right, it depends on what you're taking derivatives of—exact first-order gradients (which DiCe solves) or the first-order gradient estimator.
For the latter, have you seen Edward2's klqp
implementation? It avoids tf.stop_gradient
altogether by building a "scale factor", which is local to the stochastic node and not global like tf.stop_gradient
.
https://github.com/blei-lab/edward/blob/feature/2.0/edward/inferences/klqp.py#L36
That's a slightly dense implementation, might need a few pointers. Is the idea to have the surrogate_loss do an implicit stop_gradient
by swapping in the probability calculated using x.value
instead of x? Doesn't that lock the node to x.value
at definition time?
edit: For the record, using graph_editor
seems to work although it seems less elegant.
Yes, so DiCE let's you define an objective such that the gradient of the objective is an estimator of the gradient. This holds for arbitrary orders of derivatives, so you don't have to worry about how to differentiate the estimator.
I think I understand your use case though and I agree that it's not obvious that DiCE would solve this out of the box.
This pyro DiCE implementation & use_case might be informative: https://github.com/uber/pyro/blob/684c909c7f66ced5408d4ea01dff9259d8b19bd2/pyro/infer/util.py#L109 https://github.com/uber/pyro/blob/ec8714f36de26d11c4d87155f68ba1e3d1868f2d/pyro/infer/traceenum_elbo.py#L17
Thanks for the pointer! From this:
for model_trace, guide_trace in self._get_traces(model, guide, *args, **kwargs):
elbo_particle = _compute_dice_elbo(model_trace, guide_trace)
if is_identically_zero(elbo_particle):
continue
elbo += elbo_particle.item() / self.num_particles
it would appear that they don't have a clever way of vectorizing over samples.