glow Calculation of scale term

Hi guys,

In the paper in Table 1, you specify that the NN learns the log of the scale, and thus the scale is calculated as s = exp(log s). However, in your code, the scale is calculated by scale = tf.nn.sigmoid(h[:, :, :, 1::2] + 2.). Would it be possible to elaborate on why this calculation was used instead? I'm assuming it's for reasons of numerical stability?

Oct 17 '18 04:10 meowcakes

I'm also curious about this. What was the reasoning for switching from exp to sigmoid? Was it just to keep the result bounded?

Feb 26 '19 18:02 catalys1

So I asked the authors at NeurIPS last year - using sigmoid here is to bound the gradients of the affine coupling layer. In the previous Real-NVP work a tanh is used for the same reason. I've tried training without this kind of bounding and it didn't converge.

Feb 28 '19 11:02 wanglouis49

Ah, I see. I wonder if you could use something like ReLU(x) + 1. Then your gradient would always be nice and strong, and the constant would prevent the divide by zero problems.

Mar 01 '19 15:03 catalys1

I think the problem is not that the gradient is not strong enough. Actually quite the opposite you wanna bound it.

Mar 01 '19 20:03 wanglouis49

Sure, I understand that. But the gradient of the ReLU is bounded (it's constant) as well. And it's a simpler function, without the vanishing gradient problem of the sigmoid. I don't know if it would perform any better or not though.

Mar 01 '19 23:03 catalys1

Note that y = scale * x + shift and scale = tf.nn.relu(h[:, :, :, 1::2]), shift = h[:, :, :, ::2]. I agree that dy/dh is bounded here due to relu, but dy/dx = scale which corresponds the entries of the Jacobian matrix is still unbounded. If this is the case, the determinant of Jacobian can be huge, suggesting a dramatic volume grow from x to y. This makes the training unstable in my experiments. The idea of using sigmoid is to bound the dy/dx from above - in fact it only allows the volume to shrink. I think it sacrifices the capacity for stability. However, I don't have any further intuition other than these observations. Guess it should be something to overcome.

Mar 02 '19 10:03 wanglouis49

@wanglouis49 Thanks for your insight.

Mar 05 '19 19:03 catalys1

Has anyone tried simply splitting the output of the bijective function along the channel dimension and using the first half as the scale and the second half as the translation parameter like in RealNVP/NICE? Is there any reason why GLOW uses the even numbered channels for scale and the odd numbered channels for translation?

Mar 05 '21 10:03 SrinjaySarkar

glow glow copied to clipboard

Calculation of scale term

glow
glow copied to clipboard