Confusions about Masked Operations.
Confusions about Masked Operations:
(1) Are masked operations necessary for both pixel and attention maps during both training and testing stages?
(2) According to Table 1 of the original paper, masked operations appear to be more important. What distinguishes the masked operations on attention maps from Attention Dropout operations? Will Attention Dropout achieve comparable performance?
Thank you for your interest in our work.
(1) Are masked operations necessary for both pixel and attention maps during both training and testing stages?
Training stage: input mask + attention mask Testing stage: attention mask only
(2) According to Table 1 of the original paper, masked operations appear to be more important. What distinguishes the masked operations on attention maps from Attention Dropout operations? Will Attention Dropout achieve comparable performance?
As we emphasized in our paper, the input mask is the most critical factor in improving generalization ability, while the attention mask is merely used to address the inconsistency between training and testing input images (the input images are masked during training, while the complete images are input during testing). The lower performance when using only the input mask in Table 1 is precisely due to this reason. Therefore, using dropout cannot achieve comparable performance. We also confirmed this in our experiments, where the "dropout" in Figure 11 represents the method using dropout.
In fact, as the attention mask can cause information loss, we are currently exploring other better methods to replace the attention mask operation, which is an important direction of feature work.
Thank you for your detailed responses and for addressing my confusions. The idea of using masked operations to improve generalization ability is both interesting and practical.
However, I still have some new questions:
(a) the attention mask is merely used to address the inconsistency between training and testing input images (the input images are masked during training, while the complete images are input during testing).
Does this suggest that the proposed idea is mainly intended for application in transformer-based structures?
(b) During the testing stage, the framework adopts the attention mask as the default. I am wondering whether the masked operation is executed in each basic Swin Transformer block during testing. Additionally, would the random operation bring fluctuations in denoising performance under the 75% attention mask ratio (according to Table 2, the model performs best at 75%)?
(b) Although a large proportion of the area is masked out, the performance is still very good. It seems that much of the information is actually redundant, which I find to be an interesting discovery. What is your perspective on this?
We sincerely thank you for your attention and discussion of our work. Your insightful questions are truly appreciated and have contributed to a deeper understanding of our research.
(a) I don't think so, because the key component of the input mask can be applied to different network structures, including CNNs. Applying it to different architectures may require some modifications. However, we haven't conducted more experiments in this area yet. Applying this method to more scenarios is indeed one of our future research directions.
(b) There will be fluctuations, but the difference in quantitative values is minimal. We ran the model 50 times, and the PSNR standard deviation was only 0.00971. Moreover, Averaging multiple results bring a performance improvement, such as an increase of about 0.3dB in PSNR.
(c) During testing, we apply a mask to the attention features, which causes some information loss. Despite this, our model still performs well on noise types outside the training set. This supports our assertion that the standard training approach is overfiting the noise distribution in the training set. Therefore, when tested on a diverse and complex set of noise types outside the training set, some of the network information used to overfit the training set distribution will be redundant, and even if discarded, it will not cause significant performance loss. I believe this is very meaningful for the model's generalization performance and will be an interesting research direction.
Thanks!