CLIP4Clip [PAD_TOKEN] is not used, but just adding 0

[PAD_TOKEN] is not used, but just adding 0

Open goonbamm opened this issue 2 years ago • 0 comments

Thanks to your code, I am growing every day. Thank you very much.

In every dataloader, the special tokens are initialized below.

self.SPECIAL_TOKEN = {"CLS_TOKEN": "<|startoftext|>", "SEP_TOKEN": "<|endoftext|>",
                      "MASK_TOKEN": "[MASK]", "UNK_TOKEN": "[UNK]", "PAD_TOKEN": "[PAD]"}

But I found that [MASK], [UNK], and [PAD] are not used in the code. But the problem happens when adding just zero as pad token like below.

while len(input_ids) < self.max_words:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

In vocab, there is no number for [PAD]. Token id '0' is paired with '!'.

vocab = {'!': 0, '"': 1, '#': 2, '$': 3, '%': 4, '&': 5, ... }

If some captions contains '!' and shorter than max_length, the embedding of token '!' and 'pad' will be exactly same because the token embedding method uses nn.embedding.

self.vocab_size = vocab_size
self.token_embedding = nn.Embedding(vocab_size, transformer_width)
self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
self.ln_final = LayerNorm(transformer_width)

Example

caption1 = 'The boy is crying ! ! ! [PAD] [PAD]'
caption2 = 'The boy is crying [PAD] [PAD] [PAD] [PAD] [PAD]'

I think there is no way to differentiate between the two captions.

Jan 27 '23 08:01 goonbamm

CLIP4Clip CLIP4Clip copied to clipboard

[PAD_TOKEN] is not used, but just adding 0

Thanks to your code, I am growing every day. Thank you very much.

CLIP4Clip
CLIP4Clip copied to clipboard