CLIP4Clip
CLIP4Clip copied to clipboard
[PAD_TOKEN] is not used, but just adding 0
Thanks to your code, I am growing every day. Thank you very much.
In every dataloader, the special tokens are initialized below.
self.SPECIAL_TOKEN = {"CLS_TOKEN": "<|startoftext|>", "SEP_TOKEN": "<|endoftext|>",
"MASK_TOKEN": "[MASK]", "UNK_TOKEN": "[UNK]", "PAD_TOKEN": "[PAD]"}
But I found that [MASK], [UNK], and [PAD] are not used in the code. But the problem happens when adding just zero as pad token like below.
while len(input_ids) < self.max_words:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
In vocab, there is no number for [PAD]. Token id '0' is paired with '!'.
vocab = {'!': 0, '"': 1, '#': 2, '$': 3, '%': 4, '&': 5, ... }
If some captions contains '!' and shorter than max_length, the embedding of token '!' and 'pad' will be exactly same because the token embedding method uses nn.embedding.
self.vocab_size = vocab_size
self.token_embedding = nn.Embedding(vocab_size, transformer_width)
self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
self.ln_final = LayerNorm(transformer_width)
Example
caption1 = 'The boy is crying ! ! ! [PAD] [PAD]'
caption2 = 'The boy is crying [PAD] [PAD] [PAD] [PAD] [PAD]'
I think there is no way to differentiate between the two captions.