open_clip icon indicating copy to clipboard operation
open_clip copied to clipboard

OpenCLIP's tokenizer is slightly different than OpenAI's CLIP tokenizer.

Open vedantroy opened this issue 2 years ago • 2 comments

Small discrepancy noticed between the 2 tokenizers:

  • https://github.com/openai/CLIP uses special tokens in the form <|startoftext|>
  • this repository uses special tokens of the form <start_of_text>

Not a huge difference, but figured I'd make sure there's no reason for this.

vedantroy avatar Feb 28 '23 22:02 vedantroy

Following up on this, this is a slightly patched version of the tokenizer in this repo (just made one of the methods an instance method instead of a module method):

    tokenizer = clip_tokenizer.SimpleTokenizer(
        bpe_path=str(vocab_path)
    )
    output = tokenizer.tokenize("hello world", context_length=77) 
    print(output.shape)
    decoded = tokenizer.decode_tensor(output[0])
    print(decoded)

And the decoded output is:

<start_of_text>hello world <end_of_text>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I'm assuming the exclamation points at the end are fine?

vedantroy avatar Feb 28 '23 23:02 vedantroy

Hi @vedantroy for the different special tokens I don´t know if there is a specific reason, for the exclamation marks, the reason they are there is that the tokenizer uses 0 as padding index but it is also the index assigned to the exclamation mark, unfortunately I think changing this would be a lot of work as all existing models work like this

gpucce avatar Mar 01 '23 09:03 gpucce