SEAN Clarification of network training

Clarification of network training

Open lorenmt opened this issue 4 years ago • 7 comments

Hello,

Thanks for providing source-code of this amazing work.

I am trying to reproduce your work line-by-line, and found a lot of implementation unclear. Hopefully, you would help to clarify the following questions:

The code function and module namings are very confusing. The region-based adaptive method from the paper is called SEAN, while in this repo is called ACE. I hope the author (if having some time) could slightly modify the code into a more readable way. This would significantly help the readers to easily catch up with the detailed idea, since the majority code base is from SPADE.
What's the intuition for the last ResBlk degraded into standard SPADE block rather than using the SEAN across all layers?
From the paper (and the code), the loss function is exactly the same from SPADE and pix2pix. I am wondering, how to have the reconstruction* signal to push the generator generates the same input data? Just from the perceptual loss and feature matching loss?
So the entire training only does reconstruction, or other words, GAN training, using the same style code extracted from the same training data? Only after it fully converges, then we re-shuffle style code from different data to perform style transfer. Am I understanding this correctly? So we shouldn't reshuffle the style code during training?

Thanks very much for the help.

Best,

Apr 28 '20 00:04 lorenmt

Thank you for questions.

Sorry for the confusing codes. Actually I should spend some time refactoring the codes and I plan to do it before May. The stupid name "ACE" was borrowed from the palying cards, just like "SPADE" ... and I can't modify it now because the trained weights are recorded under the name "ACE" （￣.￣）
It's not from intuition. The experiments show that if the SEAN block is added to the last ResBlk, the performance will drop a lot. When the image resolution is increased, the de-normalization parameters in the same semantic region are almost the same . I guess that is the main reason.
Yes. For reconstruction, the network relies only on these two losses.
Yes, your understanding is right. I wanted to reshuffle the style code during training, but training takes too much time, so I didn't do that. If the re-shuffle is used, the style transfer performance will definitely improve. For the current version, if the two images are too far away, it doesn't perform well.

Apr 29 '20 02:04 ZPdesu

Thanks, that is very helpful.

Here are some follow-up questions.

Well...That's interesting. But you can load the weights, and modify the key name, and then save the weights. You don't have to retrain the network from scratch as long as you keep the network structure the same.
Could you slightly elaborate on what you mean by de-normalization parameter? Did you mean that beta^s and gamma^s for the same semantic map become the same? So it would be easier just to use the SPADE block to parameterize?
For computing the FID score, how many fake samples did you generate? And when generating the fake samples, did you just do reconstruction (same style code for training data) or you shuffle the style code as well? I think just performing reconstruction would be consistent with the baseline SPADE and pix2pix, however, just doing so would fail to evaluate the quality of style transfer.... So I would like to how did you evaluate the FID score?
I have tried to train SEAN by re-shuffling the style code. To be specific: I did reconstruction like you did in the paper, and then I added another loss: to reshuffle the style (just inverse the batch index for simplicity, so each ith segmentation map would correspond to negative ith style code) and update generator with only the standard GAN loss. (I assume the perceptual loss and GAN feature matching loss would only contribute to the reconstruction.) But I found doing so, the generation quality is really poor, significantly lower than the original setting just trained on reconstruction. Just wondering whether I did it correctly, or reshuffling the style code would take much longer time to converge?

Thanks again for the help.

Apr 29 '20 12:04 lorenmt

Thanks for comments.

Thank you for your suggestion, and it helps solving the problem. Actually I still need to refactor the code to combine other datasets and increase the freedom of network.
As what you said, de-normalization parameters mean beta^s and gamma^s. The problem with SEAN block is that beta^s and gamma^s calculated from different style codes are quite diverse and they do affect the final results a lot. On the other hand, beta^s and gamma^s calculated from SPADE block are much smoother since most of the segmentation masks are similar. We empirically find it's better to use only SPADE in the last residual block. Adding SEAN in the last residual block would cause obvious boundary problems and weird repeated patterns.
For computing the FID score, we still compare the reconstruction results. Specifically, we compute the FID between the real images from the training set (28000) and reconstructed test images (2000). You are right it's not so fair to evaluate the style transfer results. Traditional style transfer task (e.g. gender trasfer) would use auxiliary classifier to help the evaluation, but we are working on a multimodal task, so it's tough to do a fair comparison.
Some components are interrelated (e.g, 1. neck, nose, face, 2 eyes 3, up lip, down lip .....), so a complete reshuffling would be very difficult. But I guess you only tried the entire image style reshuffling which should be relatively easier. Maybe you can do a multi-step optimization (e.g in the first 50 epochs, only train on reconstruction, and for the next 50 epochs, introduce the reshuffling.) If it works, please leave a message for me.

Thank you!

Apr 29 '20 16:04 ZPdesu

Hello again,

Thanks for your clarification.

For Q4, I think that I have found an elegant solution to this: For any random index, (to reshuffle the style in the image-level), we apply this same index on the VGG perceptual loss. That we could encourage the transferred style is learned correctly. The rest of the losses are left unchanged.

This is the code for further clarification:

The network has not fully been converged yet, but it seems that it works fine at the current moment.

Could you confirm this setting would work in your original model, in particular, whether this setting would further improve the FID score?

Also, could you elaborate on how you did the mixture of style in your original method, I want to keep that as a reference.

Thanks.

Apr 30 '20 16:04 lorenmt

Update: We need to swap the index for fake_pred as well.

Further, for any non-defined style code swapped to a segmentation map with the missing region, it would cause instability in training. The strategy is to keep the non-defined one as it is and only swap the style code when mutually defined in both images.

Apr 30 '20 19:04 lorenmt

Thank you for your update.

Adding resampling when training perceptual loss and feature matching loss sounds a good idea. FUNIT also used similar ways (*feature matching loss) to match the styles with class images. But the difference is that their classifiers have category labels which enforce the generated image belongs to the category of the given class image, while ours is a binary T/F classfifier. Therefore, I am not sure whether your solution is an effective way or which layer is more suitable for the loss matching . When I am free, I will try your setting and some related experiments.

Further, do you still use the reconstruction loss or completely resampled? For FID comparison on reconstruction task, I guess your solution will not help. But I think it can improve the performance on style transfer.

Apr 30 '20 21:04 ZPdesu

For the mixture of styles, are you tallking about style interpolation or style crossover? Style Interpolation is just a simple linear operation of the style codes, and style crossover is substituting the style codes in some Resblocks.

Apr 30 '20 22:04 ZPdesu

SEAN SEAN copied to clipboard

Clarification of network training

SEAN
SEAN copied to clipboard