glow icon indicating copy to clipboard operation
glow copied to clipboard

Calculation of scale term

Open meowcakes opened this issue 5 years ago • 8 comments

Hi guys,

In the paper in Table 1, you specify that the NN learns the log of the scale, and thus the scale is calculated as s = exp(log s). However, in your code, the scale is calculated by scale = tf.nn.sigmoid(h[:, :, :, 1::2] + 2.). Would it be possible to elaborate on why this calculation was used instead? I'm assuming it's for reasons of numerical stability?

meowcakes avatar Oct 17 '18 04:10 meowcakes

I'm also curious about this. What was the reasoning for switching from exp to sigmoid? Was it just to keep the result bounded?

catalys1 avatar Feb 26 '19 18:02 catalys1

So I asked the authors at NeurIPS last year - using sigmoid here is to bound the gradients of the affine coupling layer. In the previous Real-NVP work a tanh is used for the same reason. I've tried training without this kind of bounding and it didn't converge.

wanglouis49 avatar Feb 28 '19 11:02 wanglouis49

Ah, I see. I wonder if you could use something like ReLU(x) + 1. Then your gradient would always be nice and strong, and the constant would prevent the divide by zero problems.

catalys1 avatar Mar 01 '19 15:03 catalys1

I think the problem is not that the gradient is not strong enough. Actually quite the opposite you wanna bound it.

wanglouis49 avatar Mar 01 '19 20:03 wanglouis49

Sure, I understand that. But the gradient of the ReLU is bounded (it's constant) as well. And it's a simpler function, without the vanishing gradient problem of the sigmoid. I don't know if it would perform any better or not though.

catalys1 avatar Mar 01 '19 23:03 catalys1

Note that y = scale * x + shift and scale = tf.nn.relu(h[:, :, :, 1::2]), shift = h[:, :, :, ::2]. I agree that dy/dh is bounded here due to relu, but dy/dx = scale which corresponds the entries of the Jacobian matrix is still unbounded. If this is the case, the determinant of Jacobian can be huge, suggesting a dramatic volume grow from x to y. This makes the training unstable in my experiments. The idea of using sigmoid is to bound the dy/dx from above - in fact it only allows the volume to shrink. I think it sacrifices the capacity for stability. However, I don't have any further intuition other than these observations. Guess it should be something to overcome.

wanglouis49 avatar Mar 02 '19 10:03 wanglouis49

@wanglouis49 Thanks for your insight.

catalys1 avatar Mar 05 '19 19:03 catalys1

Has anyone tried simply splitting the output of the bijective function along the channel dimension and using the first half as the scale and the second half as the translation parameter like in RealNVP/NICE? Is there any reason why GLOW uses the even numbered channels for scale and the odd numbered channels for translation?

SrinjaySarkar avatar Mar 05 '21 10:03 SrinjaySarkar