CvT icon indicating copy to clipboard operation
CvT copied to clipboard

code mismatch with the theory

Open basavaraj-hampiholi opened this issue 2 years ago • 10 comments

Hi All,

Thanks for providing the code. I come across the mismatch between the code and the theory you proposed for the transformer block. The paper says "Instead, we propose to replace the original position-wise linear projection for Multi-Head Self-Attention (MHSA)", but lines 198-200 in https://github.com/leoxiaobin/CvT/blob/main/lib/models/cls_cvt.py still projects q,k,v through linear layers. Have you missed an else statement there? why are you projecting q,k,v values twice?

Please correct me if I have misunderstood it.

Thanks, Basavaraj

basavaraj-hampiholi avatar Jul 29 '21 14:07 basavaraj-hampiholi

This is after conv_proj_q, conv_proj_k and conv_proj_v. But I'm not sure why the authors still use the pointwise projections after the conv projections.

askerlee avatar Aug 06 '21 13:08 askerlee

@askerlee, I think it is part of depthwise separable convolutions. Depthwise convolutions followed by pointwise projections.

basavaraj-hampiholi avatar Aug 17 '21 07:08 basavaraj-hampiholi

I want to know the code how to call the get_cls_model function in the cls_cvt.py

diaodeyi avatar Sep 21 '21 01:09 diaodeyi

Hi @diaodeyi , In the present code, get_cls_model function is called by the registry.py. You can use build_model in build.py to call the model. Otherwise, you can remove the registry and directly call get_cls_model function. Both way should work

Good luck..

basavaraj-hampiholi avatar Sep 21 '21 09:09 basavaraj-hampiholi

By the way , theself.proj = nn.Linear(dim_out, dim_out)Means FFN only projection with same dimension?

diaodeyi avatar Oct 03 '21 02:10 diaodeyi

@diaodeyi It's the single linear layer (with the same in/out dimension) right after the attention calculation. The FFN in this code is class MLP (line 53).

basavaraj-hampiholi avatar Oct 06 '21 07:10 basavaraj-hampiholi

Thanks, there are so many linear projections that aren't be mentioned by paper.

diaodeyi avatar Oct 06 '21 08:10 diaodeyi

@diaodeyi Yes. I think they have left them out with the presumption that the reader has a prior good understanding of basic transformer architecture.

basavaraj-hampiholi avatar Oct 06 '21 08:10 basavaraj-hampiholi

@askerlee, I think it is part of depthwise separable convolutions. Depthwise convolutions followed by pointwise projections.

No, I think the proj_q\k\v are exactly the things the paper does not mention.

diaodeyi avatar Mar 09 '22 08:03 diaodeyi

@askerlee, I think it is part of depthwise separable convolutions. Depthwise convolutions followed by pointwise projections.

No, I think the proj_q\k\v are exactly the things the paper does not mention.

Hi, the seperable depth conv contains two parts: depth-wise conv and point-wise conv. The author implemented the point-wise conv via the linear layer, maybe because it's convenience for the ablation study. The only difference between them is the bias term.

Markin-Wang avatar Oct 01 '22 21:10 Markin-Wang