ViT-Adapter When I try to replicate the code, the results are not satisfactory.

I attempted to replicate the ablation experiments from Table 6 as described in the paper. When I added the SPM module to the ViT-S model, I obtained a result of 40.0, which is lower than the reported result of 41.6 in the paper. My approach involved resizing c1, c2, c3, and c4 to 1/16 resolution and then adding them to the patch embeddings in the model. Finally, the last block of the transformer was connected to the head. However, I couldn't find detailed descriptions of this specific ablation experiment in the paper. Could you provide more specific details about this part of the ablation experiment? Thank you very much.

Additionally, I noticed that the learning rate and fp settings are different between the 1x and 3x training configurations. I would like to confirm if the 1x ablation experiment for ViT-Small follows the same configuration as the 'tiny' setting in the code (including learning rate, fp, etc.), with modifications only made to the network size?

Thanks!

Jul 10 '23 03:07 tandangzuoren

Hi tandangzuoren, I need some time to clean up the code of the ablation experiments.

Jul 10 '23 14:07 czczup

Hi tandangzuoren, I need some time to clean up the code of the ablation experiments.

I'm looking forward to your response. I have successfully applied the SPM module using the 3x strategy, but I couldn't achieve the reported result of 41.6 in the paper when using the 1x configuration. So I wanted to confirm with you whether my understanding of the SPM ablation method is correct or not.

Jul 11 '23 02:07 tandangzuoren

"by directly resizing and adding the spatial features from SPM, our variant 1 improves 1.4 APb and 0.9 APm over the baseline" Could you first clarify whether the addition here is the addition of the output resize of the last layer of vit and spm's c1,c2,c3,c4, or the addition of spm's c1c2c3c4 resize to the 'x' of patch embed and then through vit? It's very different thanks!

Jul 11 '23 09:07 tandangzuoren