CLIP
CLIP copied to clipboard
What is the limit for text sequences?
I am trying to apply CLIP on a very specific dataset and need to fine tune. I am doing fine tuning following the steps here https://github.com/openai/CLIP/issues/83.
But cannot figure out what is maximum size of the text
sequence length. Can it handle large texts? Or I have to restict the text to a pre-defined limit.
Any help is appreciated.
According to this function, it can tokenize sequences up to the length of 77
(i.e., 77 words/tokens) and raises a runtime exception for anything beyond.
The default length is 77, you might need to perform some more fine-tuning to accept larger text sequences. Below is how you can do it.
from transformers import (
CLIPTextConfig,
CLIPVisionConfig,
CLIPTextModelWithProjection,
CLIPVisionModelWithProjection
)
CLIP_CHECKPOINTS = "openai/clip-vit-base-patch32"
PROJECTION_DIM = 512 # Replace with your desired projection dimension
padding_max_length = 100 # Replace with your desired maximum position embeddings
textConfig = CLIPTextConfig.from_pretrained(CLIP_CHECKPOINTS)
textConfig.projection_dim = PROJECTION_DIM
textConfig.max_position_embeddings = padding_max_length
visionConfig = CLIPVisionConfig.from_pretrained(CLIP_CHECKPOINTS)
visionConfig.projection_dim = PROJECTION_DIM
# Using CLIP text model with a projection head on top
clipTextModel = CLIPTextModelWithProjection.from_pretrained(
pretrained_model_name_or_path=CLIP_CHECKPOINTS,
config=textConfig,
ignore_mismatched_sizes=True
)
# Using CLIP vision ViT model with a projection head on top
clipVisionModel = CLIPVisionModelWithProjection.from_pretrained(
pretrained_model_name_or_path=CLIP_CHECKPOINTS,
config=visionConfig,
ignore_mismatched_sizes=True
)
This code snippet can help.
Hello, I would like to ask where this code should be added? Is transformers a library? Looking forward to your response
@chuyihuan If you are looking forward to fine-tuning CLIP then the snippet can be useful. yes, transformers is a library.