plip icon indicating copy to clipboard operation
plip copied to clipboard

How to use the tokenizer?

Open bhosalems opened this issue 1 year ago • 1 comments

Thanks for releasing the model on Huggingface.

I wanted to use the text encoder. For that I need to tokenize the input. But how to use the tokenizer? Can we use it from the CLIPprocessor?

processor = CLIPProcessor.from_pretrained("vinid/plip") tokenizer = processor.tokenizer

But with this, the max_model_length is insanely high value 1000000000000000019884624838656. So I was wondering if this is the correct use.

CLIPTokenizerFast(name_or_path='vinid/plip', vocab_size=49408, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True), added_tokens_decoder={ 49406: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 49407: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), }

bhosalems avatar Nov 21 '23 04:11 bhosalems

Please provide us with a minimum reproducible code, so we may be able to assist. Here is a good tutorial on how to prepare a minimal, reproducible code example: https://stackoverflow.com/help/minimal-reproducible-example

huangzhii avatar Mar 26 '24 01:03 huangzhii