OpenCLIP's tokenizer is slightly different than OpenAI's CLIP tokenizer.
Small discrepancy noticed between the 2 tokenizers:
- https://github.com/openai/CLIP uses special tokens in the form
<|startoftext|> - this repository uses special tokens of the form
<start_of_text>
Not a huge difference, but figured I'd make sure there's no reason for this.
Following up on this, this is a slightly patched version of the tokenizer in this repo (just made one of the methods an instance method instead of a module method):
tokenizer = clip_tokenizer.SimpleTokenizer(
bpe_path=str(vocab_path)
)
output = tokenizer.tokenize("hello world", context_length=77)
print(output.shape)
decoded = tokenizer.decode_tensor(output[0])
print(decoded)
And the decoded output is:
<start_of_text>hello world <end_of_text>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
I'm assuming the exclamation points at the end are fine?
Hi @vedantroy for the different special tokens I don´t know if there is a specific reason, for the exclamation marks, the reason they are there is that the tokenizer uses 0 as padding index but it is also the index assigned to the exclamation mark, unfortunately I think changing this would be a lot of work as all existing models work like this