HiSD icon indicating copy to clipboard operation
HiSD copied to clipboard

Questions about the attention map

Open HelenMao opened this issue 3 years ago • 2 comments

Hi, I am trying the AFHQ dataset of your model and find your model can preserve the background of the source image very well. I think it is thanks to the attention map, and I visualize the attention map and find it can learn the mask.

However, when I am trying to copy the attention module to my own framework (this paper), I find it does not work at all and fail to learn the mask. The main difference between mine and yours lie in the mapping network usage and without KL/MMD related loss between the random noise distribution and the reference encoder embedding distribution (I directly replace your generator and fail to learn the mask too).

I am wondering do you have some experience with your attention map design. What do you think when it can learn the mask? It would be really great if you can share some experience with me, thanks a lot!

Looking forward to your reply!

HelenMao avatar May 29 '21 08:05 HelenMao

I've also tried the AFHQ dataset and found that HiSD only focuses on manipulating the shape and maintains the background and color, which will be presented in the camera ready supplemental material.

I think there are some key points why HiSD succeeds to learn the mask without any extra objective: 1. separate translator for each tag or semantic; 2. no diversification loss; and 3. applying the mask on the feature rather than the image (which means that both channel wise and spatial wise are important).

In previous works, a regularization objective is always needed, I think the reason is that a spatial-wise-only mask is hard to learn for the generator.

imlixinyang avatar May 29 '21 14:05 imlixinyang

I think the tag operation may not influence since I use one tag when I am running the AFHQ dataset.

The diversification loss may have some influence, and I need to do more experiments.

I directly copy your generator (including both the translator and decoder) to my own framework. I think your generator do use both channel-wise and spatial-wise attention map. However, it still cannot learn the mask. Therefore, I think it may not be the main reason.

HelenMao avatar May 30 '21 13:05 HelenMao