glow-realnvp-tutorial icon indicating copy to clipboard operation
glow-realnvp-tutorial copied to clipboard

Raise NaN when using model to fit `celebA` dataset

Open gitlabspy opened this issue 5 years ago • 12 comments

The initial loss / log_prob is very big like 1e24. I tried CUB200 as well, but CUB200 has around 40000 initial loss and training goes well.
When I try to use your model in folder realworld to fit celebA, it failed at the very beginning. Raised NaN.
Both CUB200and celebA resized to same size (224, 224, 3), but why it failed on celebA ?
BTW, this tutorial is very good!

gitlabspy avatar Feb 21 '20 06:02 gitlabspy

I show you these difference with glow. and glow's problem , tfp's problem.

  • in glow, his gaussianize for factor-out's latent has an issue. this is a reason for raising NaN…(but I don't have no proof)

    In realnvp, flow++, my implement zi,hi = factor-out(hi-1) zi 〜 N(0,1) In glow, zi 〜 N(mu, sigma) mu, sigma = convnet(hi)

  • we need normalisation for weight because this network is too sensitive (glow has the same problem)

  • tfp's logdet jacobian has shape [] but in affine coupling, logdet jacobian has the shape [batch-size] this may show tfp has a critical problem about loss formula.

MokkeMeguru avatar Feb 21 '20 13:02 MokkeMeguru

I think weight normalisation is good for preventing NaN in some experiments. But I know no papers about this.

(by the way, I'm writting another tensorflow's normalization in TFGENZOO. )

MokkeMeguru avatar Feb 21 '20 13:02 MokkeMeguru

ref https://github.com/tensorflow/probability/issues/576

MokkeMeguru avatar Feb 21 '20 13:02 MokkeMeguru

I show you these difference with glow. and glow's problem , tfp's problem.

  • in glow, his gaussianize for factor-out's latent has an issue. this is a reason for raising NaN…(but I don't have no proof) In realnvp, flow++, my implement zi,hi = factor-out(hi-1) zi 〜 N(0,1) In glow, zi 〜 N(mu, sigma) mu, sigma = convnet(hi)
  • we need normalisation for weight because this network is too sensitive (glow has the same problem)
  • tfp's logdet jacobian has shape [] but in affine coupling, logdet jacobian has the shape [batch-size] this may show tfp has a critical problem about loss formula.

By his gaussianize for factor-out's latent , I don't quite get it. What is hi? zi 〜 N(mu, sigma) you mean the latent variable's distribution( Gaussian distribution) has learnable parameters mu and sigma which learn from convnet? Or you are saying the affine coupling layer?
By tfp problem, you are saying logdet will be broadcast to a [batch-size, ] instead of a whole batch-size as one ?
I think normalization for weight is a good idea😊, I will try!

gitlabspy avatar Feb 21 '20 16:02 gitlabspy

  1. mu and sigma is trainable variable in training in Glow. Not an affine coupling layer. ref. 1. https://github.com/openai/glow/blob/master/model.py#L89 (this is splitting) 2. https://github.com/openai/glow/blob/master/model.py#L89 (it's definition) 3. https://github.com/openai/glow/blob/master/model.py#L576-L584 (get mu and sigma from z1) 4. https://github.com/openai/glow/blob/master/model.py#L552 (gaussianize z2 by mu and sigma)

  2. Yes, I think

MokkeMeguru avatar Feb 25 '20 02:02 MokkeMeguru

Thx Mokke. I totally understand your points. I found something kinda interesting in Glow source code, which is they were not using affine coupling layer instead they used additive coupling. I think it might have something to do with Actnorm. I noticed you used Affine coupling in your implementing of Glow. Will that be a conflict between Actnorm and Affine coupling which might cause raise NaN?

gitlabspy avatar Mar 02 '20 14:03 gitlabspy

I think No. Surely, AffineCoupling Layer has multiply operation. (Adductive Coupling doesn't have it) But, multiply operation uses scaled value. (https://github.com/MokkeMeguru/glow-realnvp-tutorial/blob/master/examples/models/affineCoupling.py#L55-L63)

(So, I think "SCALED" is very good method avoiding NaN. Ex. I recommend use weight normalization. And someone says, "we should use normalization for inv1x1conv" https://github.com/openai/glow/issues/40#issuecomment-462103120)

MokkeMeguru avatar Mar 03 '20 00:03 MokkeMeguru

Thx again for explaining all these! There is one more thing I found in your code recently. For the sake of convenience, I post it here, hope you don't mind. 😅 I found something confusing. In your implementing of GLOW , affineCoupling layer, the jacobian seems not quite right. https://github.com/MokkeMeguru/glow-realnvp-tutorial/blob/10461d7a0db9fb59e8b630668d2409ec7dcd43fa/realworld/layers/affineCoupling.py#L156 The jacobian is sum(log|s|) in original paper, as far as I understand, shouldn't it be tf.reduce_sum(tf.math(tf.math.abs(log_s)))?

gitlabspy avatar Mar 05 '20 12:03 gitlabspy

At first, |logs| and log|s| is not same. In the paper, the reason why they use log|s| instead of logs is log x (x <= 0) is not defined.

So they says, "s"'s domain is mathbb{R} not mathbb{R}^{+}. But you can see the Function in affine coupling layer in the paper,

y = exp(log s) x + t

This shows "s"'s domain is mathhbb{R}^{+}. So, we can use log_s instead of log |s|.

Q. why I calculate log s instead of s in NN layer? A. If s is 0.0000000...1, log s will be NaN.

MokkeMeguru avatar Mar 05 '20 22:03 MokkeMeguru

how to save the model in realNVP? I try to use flow.save(), but raise object has no attribute 'save'.

yangyijune avatar Sep 08 '20 15:09 yangyijune

I think you did save tf.keras.layers.Layer. please wrap your layer with tf.keras.layers.Model. ref. https://github.com/MokkeMeguru/glow-realnvp-tutorial/blob/master/realworld/model.py#L116

MokkeMeguru avatar Sep 08 '20 16:09 MokkeMeguru

@MokkeMeguru Hey, this issue make me curious how does tfp version of glow generating celeba look like... so I tinker your code a bit, adding variational dequantization and making neural nets a bit more complex and train it on celeba, the results from first few epochs look so bad..... I use transformedDistribution.sample() the image directly. celeba_sample Do you think that the result above is correct?

The NaN problem is caused by the data preprocess... Data should be correctly logitsify( term might be wrong) and adding dequantization, just like you did in TFGENZOO.

Your works in TFGENZOO is fantasty! Due to some dependencies issues I stick with tfp tho. But anyway, there merely nothing open source codes about flow-based model in tfp, not even in tf2 . 😆

So, can you help me out please? Do you think that the model is correct based on that sampled image? What might cause it in your opinion? Thx for your works again!!!!

gitlabspy avatar Sep 09 '20 15:09 gitlabspy