CLIP
CLIP copied to clipboard
How can I reduce dimension of image feature by encode_image?
image = preprocess(Image.open(image_name)).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
write_image_tensor(item_id, image_features)
type of image_features is below: <class 'torch.Tensor'> torch.Size([1, 512]) How can I reduce dimension of image feature by encode_image? eg: torch.Size([1, 128]) or torch.Size([1, 64])
Many thanks.
Facing a similar issue. Need to increase dimension to [1,1024]. Did you find any way @sdalinluo?
It is not easy to increase/decrease the dimension of the image embedding without fine-tuning CLIP again. So, what we can do is attach a Projection layer at the end of the CLIP encoder and decoder models, and finetune these models again.
I solved similar problems using HuggingFace's implementation of CLIP. Below is the code you can also try out,
from transformers import (
CLIPTextConfig,
CLIPVisionConfig,
CLIPTextModelWithProjection,
CLIPVisionModelWithProjection
)
CLIP_CHECKPOINTS = ""openai/clip-vit-base-patch32""
PROJECTION_DIM = 512 # Replace with your desired dimensions
padding_max_length = 77 # default is 77 that clip uses,
textConfig = CLIPTextConfig.from_pretrained(CLIP_CHECKPOINTS)
textConfig.projection_dim = PROJECTION_DIM
textConfig.max_position_embeddings = padding_max_length
visionConfig = CLIPVisionConfig.from_pretrained(CLIP_CHECKPOINTS)
visionConfig.projection_dim = PROJECTION_DIM
# Using CLIP text model with a projection head on top
clipTextModel = CLIPTextModelWithProjection.from_pretrained(
pretrained_model_name_or_path=CLIP_CHECKPOINTS,
config=textConfig,
ignore_mismatched_sizes=True
)
# Using CLIP vision ViT model with a projection head on top
clipVisionModel = CLIPVisionModelWithProjection.from_pretrained(
pretrained_model_name_or_path=CLIP_CHECKPOINTS,
config=visionConfig,
ignore_mismatched_sizes=True
)
# Now train CLIP, so that the embeddings become meaningful..
``
That's great! It's really helpful to set up config info when using CLIP
Hi, Can you help with how should it be trained after applying the above code (like written in the comment at the end in the above code snippet)
Hi, Can you help with how should it be trained after applying the above code (like written in the comment at the end in the above code snippet)
In fact, I finally didn't use CLIP like that. But you may try to set the training mode like xxxmodel.train(), and consider it as a pretrained NN to finetune it in your way. Hope it is helpful.
Thanks a lot. Will try that on my dataset.