vae-gan-tensorflow icon indicating copy to clipboard operation
vae-gan-tensorflow copied to clipboard

Training Unstable

Open Lotayou opened this issue 6 years ago • 19 comments

Description

The training is highly unstable, nine out of ten times the reconstructed pictures just ended up with random noise patterns which looks almost the same for all input vector z.

Things I did

I ran the project several times with the same piece of code. The only thing I changed is the data loading mechanism, where I created a huge 202599 *64 * 64 * 3 np.ndarray object to contain all cropped images in memory to save loading time.

The parameters and network architecture are not changed in any way. The sequence of the dataset is not altered during creation. But I've tried random shuffling before training, and the issue still remains.

By the way, the real samples look perfectly fine so I assume there's nothing wrong with data preprocessing and loading.

Some Details

I choose to generate sample per 20 steps, and the problem seems to occur at step 140-160. At the beginning, all reconstructed images are just random tainted blocks with random noise patterns that looks the same to me, it's exactly at step 140 when some differences start to occur. But most times the observed differences just disappear and the samples roll back to random noise patterns with color varying irregularly.

Environment -Windows 10 -Python 3.6 in Anaconda -Nvidia GTX1060, 6GB -Tensorflow GPU version 1.3.0

Lotayou avatar Nov 07 '17 10:11 Lotayou

I've tried a bunch of different implementations of the VAEGAN paper, and none of them seems to converge. Different networks also use different optimizations techniques (gradient clip, sigmoid gradient for generator-discriminator). I've personally start to think this paper doesn't work..or either it need a ridiculous amount of time for doing it.

lucabergamini avatar Nov 17 '17 18:11 lucabergamini

@lucabergamini it seems that turning down the learning_rate to 0.0001 helps. But still it could take several hours (depending on your hardware and I/O speed) before the results start to make sense... The network is unstable indeed, but still trainable, given enough time and patience.

I'm thinking about using some other loss function, KL divergence could be the reason why the training is unstable. As I can see from tensorboard, the loss doesn't correlate with the visual quality of the reconstructed images very well...

Lotayou avatar Nov 20 '17 09:11 Lotayou

And you didn't use any trick (apart for the lower LR) such as LR compensation between the generator and the descriminator? Are the qualitative results really better compared to using a plain VAE to justify such a huge training time? If you want to talk about it feel free to send me a PM !

lucabergamini avatar Nov 20 '17 09:11 lucabergamini

test_00_0000_con test_00_0000_r

Yes. I didn't use any regularization tricks except gradient_clipping in the original project. Here encloses the training result after about 40 epochs, and some of my findings:

  1. The overall reconstruction qualities is satisfying, with more sharp details than VAE results. (Actually I haven't run VAE results personally, but have seen reconstruction results with VAE by my colleagues.)
  2. The frontal faces is easier to train than side-view faces, and the reconstruction results are generally more visually pleasant.
  3. It's weird, but a small blueish pattern appears on the top-left corner of many pictures... I don't quite sure what's the cause of this pattern and how to avoid it. Could this pattern be the average of background?

As previously stated, I think the loss function could play an important role in training. I agree with your idea that some Linear Regularization (if that's what LR means I assume?) could help, and I'm working on the experiment now.

Feel free to contact me on email: [email protected].

Lotayou avatar Nov 21 '17 06:11 Lotayou

@lucabergamini Forget to mention that the top one is the reconstruction result and the bottom one is the real training samples

Lotayou avatar Nov 21 '17 06:11 Lotayou

@lucabergamini How's your training getting along?

Lotayou avatar Nov 23 '17 13:11 Lotayou

@Lotayou I've written to you a few days ago, didn't you received it?

lucabergamini avatar Nov 23 '17 13:11 lucabergamini

@lucabergamini Oh I find your email in spam box... Sorry about that:)

Lotayou avatar Nov 24 '17 06:11 Lotayou

Sorry, You can set the learning_rate to 0.0001. I will update the code.

zhangqianhui avatar Nov 25 '17 12:11 zhangqianhui

I will figure out this problem and update the code for stabler training.

zhangqianhui avatar Nov 26 '17 10:11 zhangqianhui

I have updated the code for tf1.4 and more stable training.

zhangqianhui avatar Nov 27 '17 01:11 zhangqianhui

@Lotayou @lucabergamini Have you solved this? I met the same problem that the training crashed in about 100-120 epochs. Although I tried some tricks, such as Inputs Normalize, LeakyReLU Function and Generators Dropouts Layers according to here, all of these just make the crash delay to 260-280 epochs but still occurred in the end. It seems that the crash is related to encoder, since I find a suddenly increase in encode_loss when crash happened. (From top to bottom in the pic is d_loss, e_loss, g_loss) loss2 By the way, even the best performance of my generator is not as perfect as results showed in project or above. Very appreciate for any information about solving this problem or training an excellent vae-gan.

CheungXu avatar Jan 24 '18 09:01 CheungXu

@CheungXu I just followed the original network architecture of this project, the only significant change I made is tuning the learning rate down to 0.0001. Personally I guess it's more of luck than skills to train the network correctly.

I now moved on to other projects, I think @lucabergamini has an excellent PyTorch implementation here. If you are also familiar with PyTorch I recommend working on his project.

Lotayou avatar Jan 24 '18 11:01 Lotayou

@Lotayou Oh, forgot mention that I already changed the learning rate to 0.0001 in my experiment. I'm not quit familiar with PyTorch as tf, but I will consult PyTorch implementation and may try it in the future. Thanks all the same!

CheungXu avatar Jan 24 '18 11:01 CheungXu

@zhangqianhui I downloaded and ran it again. After trained 600,000 times, saved this model. The reconstructed pictures become the same random noise.

IlSLY avatar Dec 02 '18 06:12 IlSLY

How's the results in the 300,000 times?

zhangqianhui avatar Dec 02 '18 11:12 zhangqianhui

@zhangqianhui In fact,the result I got becomes bad from around 90,000 times. And, it becomes random noise since about 99,800.

IlSLY avatar Dec 03 '18 01:12 IlSLY

@CheungXu I have exactly same behaviour with on my custom dataset. The train runs fine tills suddenly the encoder crashes and in one or two epochs being almost white noise. Did you found a solution?

EnricoBeltramo avatar Jun 06 '19 10:06 EnricoBeltramo

@EnricoBeltramo I made series of changes at that time. But I don't really remember which one of them solved this problem. These are some most likely solutions: 1、Use a bigger dataset. (I use CelebA instead of LFW in later training) 2、Use MSE loss instead of NLLNormal loss in model.

def NLLNormal(self, pred, target):
      c = -0.5 * tf.log(2 * np.pi)
      multiplier = 1.0 / (2.0 * 1)
      tmp = tf.square(pred - target)
      tmp *= -multiplier
      tmp += c
      return tmp

->

def NLLNormal2(self, pred, target):
      return -tf.reduce_sum(tf.square(pred - target))

3、And give LL_loss a higher weight in G loss.

self.G_loss = self.G_fake_loss + self.G_tilde_loss - 1e-6*self.LL_loss

->

self.G_loss = self.G_fake_loss + self.G_tilde_loss - 1e-5*self.LL_loss

4、Just use AE not VAE. Which means no 'z_mean', 'z_sign' and 'KL_loss'.

self.z_mean, self.z_sigm = self.Encode(self.images)
self.z_x = tf.add(self.z_mean, tf.sqrt(tf.exp(self.z_sigm))*self.ep)
self.x_tilde = self.generate(self.z_x, reuse=False)
...
self.kl_loss = self.KL_loss(self.z_mean, self.z_sigm)

->

#(Now 'z_x' is just a vector output of encoder CNN)
self.z_x = self.Encode(self.images)
self.x_tilde = self.generate(self.z_x, reuse=False)
...
self.kl_loss = 0

5、Don't use scale factor in G/D loss.

d_scale_factor = 0.25
g_scale_factor =  1 - 0.75/2

->

d_scale_factor = 0
g_scale_factor =  0

6、Use gradient clipping in optimizer. 7、Use dropout layer in Generator when training. 8、You can find more GAN training tricks in here

Hope these are helpful to you. Good Luck.

CheungXu avatar Jun 13 '19 15:06 CheungXu