ijepa icon indicating copy to clipboard operation
ijepa copied to clipboard

Why is there no class token?

Open swarajnanda2021 opened this issue 9 months ago • 4 comments

Dear authors, thanks for this work! It is indeed a very efficient and simple architecture. If you could help me understand the following question, it would be immensely useful.

You have not provided motivations for why there is no class token. Is this because you want to use the masking of only the image patches, and therefore do not want to introduce class (or even register) tokens?

(I'm currently adapting some of this code to my workflow, and in my case, I am using a class token and four register tokens. The training is running so we shall see if there is any sense to the calculations.)

swarajnanda2021 avatar May 13 '24 02:05 swarajnanda2021

There's a bit of back and forth in the literature in general between using the [CLS] token vs. the average pooled output (over the sequence dimension). Here's a discussion you might find useful: https://github.com/huggingface/transformers/issues/7540

Beyond the (minor) performance differences, I think your theory is correct that the masking logic gets a bit messier when you have to account for the first token in the sequence being the [CLS] token. It introduces a few indexing headaches that I'd guess FAIR wanted to avoid.

spencermyoung513 avatar Jun 05 '24 17:06 spencermyoung513

I used a class token and register based vit and it was a disastrous model, haha. Thanks for the response.

swarajnanda2021 avatar Jun 06 '24 03:06 swarajnanda2021

@swarajnanda2021 Have you added register tokens to ViT like in this paper? And the result is bad? I want to try using register tokens but the pretraining phase is very costly, so if you have experience using register tokens, please tell me more details.

Spidartist avatar Jun 09 '24 13:06 Spidartist

I used register tokens and class tokens, and the results were nothing worth pursuing. The ijepa in some sense has already been implemented (in my understanding) as a secondary loss function in dinov2, where they combine the local-global similarity matching and the ibot patch loss together.

swarajnanda2021 avatar Jun 11 '24 16:06 swarajnanda2021