ast
ast copied to clipboard
some question about Deit's two [cls] token processing.
Hi, sorry to bother you. Why are the two special [CLS]tokens in DeiT said to be average as a single [CLS] token in the paper, but in the code I see that they are indeed cat together, what am I missing?
cls_tokens = self.v.cls_token.expand(B, -1, -1)
dist_token = self.v.dist_token.expand(B, -1, -1)
x = torch.cat((cls_tokens, dist_token, x), dim=1)
oh, I see it.
x = (x[:, 0] + x[:, 1]) / 2
sorry to bother you. thank you for your good work, I am newer for my master's degree in the speech area, and I want to graduate but have to post a dissertation, thank you for helping me along the way, although I haven't issued a dissertation yet haha~
To use DEIT initialization, we have to initialize in the same way as DEIT, but as you point out, we average it in the forward pass.
Good luck with your dissertation.
-Yuan