DynamicViT Structural downsampling and static token sparsification

Hi, it's a quite solid and promising work but I have some questions. (1) In the paper, you perform an average pooling with kernel size 2 × 2 after the sixth block for the structural downsampling. But in Table 3, you show the results of structural downsampling and static dynamic token sparsification. What is the difference between structural downsampling and static token sparsification since their ACCs are not same? (2) I'm interested in the average pooling with kernel size 2 × 2. Did you do extra experiments in the position of such structural downsampling, like the seventh block or the tenth block in ViT? (3) Could you provide the codes for reproducing the results of structural downsampling and static token sparsification in Table 3 and the probability heat-map in Figure 6?

Thanks for your help!

Oct 29 '21 19:10 Yeez-lee

Hi, thanks for your interest in our work.

"Structural downsampling" means that we downsample the token using 2x2 average pooling. "Static token sparsification" means that we learn a fixed parameter for each token to reflect its importance using our loss and learning method.
We perform the average pooling after the sixth block since the resulting model will have similar FLOPs compared to our method. In this experiment, we fix the overall complexity of each model and compare the performance.
You can simply add an average pooling layer after the sixth block to implement the structural downsampling method. For the static token sparsification baseline, you can replace the output of the PredictorLG as a nn.Parameter tensor that is shared for all inputs. We will update the code after the CVPR deadline.

Oct 30 '21 06:10 raoyongming

Thanks for your quick response! Look forwards to seeing your official codes for structural downsampling and static token sparsification after the CVPR deadline.

Oct 31 '21 05:10 Yeez-lee

Hi, it's a quite solid and promising work but I have some questions. (1) In the paper, you perform an average pooling with kernel size 2 × 2 after the sixth block for the structural downsampling. But in Table 3, you show the results of structural downsampling and static dynamic token sparsification. What is the difference between structural downsampling and static token sparsification since their ACCs are not same? (2) I'm interested in the average pooling with kernel size 2 × 2. Did you do extra experiments in the position of such structural downsampling, like the seventh block or the tenth block in ViT? (3) Could you provide the codes for reproducing the results of structural downsampling and static token sparsification in Table 3 and the probability heat-map in Figure 6?

Thanks for your help!

Hello, do you have the code for locating Graph-6 probability matrices? I want to reproduce the results of a paper recently but I couldn’t find the corresponding code. Looking forward to your reply, thank you.

Mar 01 '24 06:03 Aoshika123

DynamicViT DynamicViT copied to clipboard

Structural downsampling and static token sparsification

DynamicViT
DynamicViT copied to clipboard