open_clip OpenCLIP's tokenizer is slightly different than OpenAI's CLIP tokenizer.

Small discrepancy noticed between the 2 tokenizers:

https://github.com/openai/CLIP uses special tokens in the form <|startoftext|>
this repository uses special tokens of the form <start_of_text>

Not a huge difference, but figured I'd make sure there's no reason for this.

Feb 28 '23 22:02 vedantroy

Following up on this, this is a slightly patched version of the tokenizer in this repo (just made one of the methods an instance method instead of a module method):

    tokenizer = clip_tokenizer.SimpleTokenizer(
        bpe_path=str(vocab_path)
    )
    output = tokenizer.tokenize("hello world", context_length=77) 
    print(output.shape)
    decoded = tokenizer.decode_tensor(output[0])
    print(decoded)

And the decoded output is:

<start_of_text>hello world <end_of_text>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I'm assuming the exclamation points at the end are fine?

Feb 28 '23 23:02 vedantroy

Hi @vedantroy for the different special tokens I don´t know if there is a specific reason, for the exclamation marks, the reason they are there is that the tokenizer uses 0 as padding index but it is also the index assigned to the exclamation mark, unfortunately I think changing this would be a lot of work as all existing models work like this

Mar 01 '23 09:03 gpucce