diffae
diffae copied to clipboard
It looks like z-sem is not being trained
Hi, thank you for your excellent research! While performing inference through an autoencoder, I consistently obtained the same output regardless of the input image(depends only x_T). I tried training with my own data and FFHQ dataset, but the same phenomenon occurred in both cases. I think it might be related to the issue of the gradient of z-sem becoming zero, which was raised by another person, and since there was no response to that post, I decided to raise it again. (https://github.com/phizaz/diffae/issues/63) Thank you.
Can you show me the smallest working code?
Sure, here is the code
cond = model.encode(input_image) xT = model.encode_stochastic(input_image, cond, T=50) pred = model.render(noise= xT , cond=cond, T=20) pred = (pred + 1) / 2 pred = pred[0] pred = pred.permute(1, 2, 0).cpu().numpy()
plt.imsave('image.png', pred)
When I ran the code, the image was maked successfully, but the problem is that when I change the input_image to a different image, the same result is maked.
I cannot reproduce your problem. Can you provide the whole notebook with the results of encoding of both images?
All right. I'll show you the whole process in detail
Fisrt, I used the 98 epochs model learned from run_ffhq128.py. As you know, the file is divided into four parts, and only the first part was executed to learn only the autoencoder part.
gpus = [0, 1, 2, 3, 4, 5, 6, 7] conf = ffhq128_autoenc_130M() train(conf, gpus=gpus)
Second, I used an images of a person's face captured on Google as input
To show the problem I was talking about, I conducted a total of four experiments.
- cond and x_T come from image1(1.png) : image1 is restored saved as image1
- cond and x_T come from image2(2.png) : image2 is restored saved as image2
- cond comes from image2(2.png) and x_T comes from image1(1.png) : image1 is restored and result image is exactly same as case1 saved as image3
- cond comes from image1(1.png) and x_T comes from image2(2.png) : image2 is restored and result image is exactly same as case1 saved as image4
According to the above results, cond has no effect on the result image at all. Result is only affected by x_T. This doesn't make sense, because according to the paper, z-sem(cond) has more influence on the resulting image than x_T. I'll attach the inference code, image, and result image that I used. I couldn't attach the model because of the capacity. Thank you. attachment.zip
- Do you mean this problem happens to your model trained from scratch on your own dataset?
- After looking at your attachment, I think I see some artifacts to suggest that you are encoding images that "the model is not trained with". If you use the checkpoint provided by us, it will definitely not work with your images because they are not "aligned" (FFHQ images are aligned in a particular way!). If you use your own checkpoint, make sure that you don't assume some kind of alignment in your training dataset.
- That's right, and not only when learned with my data, but also learned with ffhq data, I get the same problem. In fact, the model used on attachment is that trained ffhq data (run_ffhq.py. only part of autoencoder)
- Thank you for your advice, but I think It is still strange that the results of the model are not affected by condition(z-sem). Even if new data is entered into the model, there is no reason why condition do not affect it at all. As I said before, when I ran run_ffhq.py, I ran only the autoencoder part (the rest is annotated), so do you think is it related to this problem?
- The artifacts in Image1 and image2 shouldn't be there if the model is trained and used correctly.
- z_sem can only influence outputs that are NOT part of the noise (X_T).
- It is usually the case that when your training images/test images don't share the same properties (such as aligned in the same way), most information will be contained in X_T (because the semantic encoder has no clue how to encode the image). Then, even when you change z_sem, you won't see any meaningful change to the output because most information is kept in X_T in the first place.
- Good exercise, you may also plot x_T to see what information is in there.
Thank you again for your advice. Additionally, I confirmed that the values of some parameters related to the encoder were zero for the checkpoint model I used. Then, finally, my understanding is that if I preprocess my training data and test data to have the same properties(such as aligned in the same way), it will help me to solve this problem, right?
- Do you mean this problem happens to your model trained from scratch on your own dataset?
- After looking at your attachment, I think I see some artifacts to suggest that you are encoding images that "the model is not trained with". If you use the checkpoint provided by us, it will definitely not work with your images because they are not "aligned" (FFHQ images are aligned in a particular way!). If you use your own checkpoint, make sure that you don't assume some kind of alignment in your training dataset.
What does "aligned" mean?