CLIP
CLIP copied to clipboard
ValueError: operands could not be broadcast together with shapes (4,224,224) (3,)
How is normalize in image_utils.py suppose to work on images with 4 dimensions?
File ~/tensorflow-metal/lib/python3.8/site-packages/transformers/image_utils.py:143, in ImageFeatureExtractionMixin.normalize(self, image, mean, std)
141 return (image - mean[:, None, None]) / std[:, None, None]
142 else:
--> 143 return (image - mean) / std
Here's an example from:
https://github.com/MaartenGr/Concept
https://github.com/MaartenGr/Concept/issues/12
Several images in this collection have shape
pix = numpy.array(image)
pix.shape
(960, 640, 4)
To download 25k images:
import os
import glob
import zipfile
from tqdm import tqdm
from sentence_transformers import util
# 25k images from Unsplash
img_folder = 'photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
os.makedirs(img_folder, exist_ok=True)
photo_filename = 'unsplash-25k-photos.zip'
if not os.path.exists(photo_filename): #Download dataset if does not exist
util.http_get('http://sbert.net/datasets/'+photo_filename, photo_filename)
#Extract all images
with zipfile.ZipFile(photo_filename, 'r') as zf:
for member in tqdm(zf.infolist(), desc='Extracting'):
zf.extract(member, img_folder)
img_names = list(glob.glob('photos/*.jpg'))
Here's the code:
from sentence_transformers import SentenceTransformer, util
from PIL import Image
#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')
#Encode an image:
img_emb = model.encode(Image.open('.notebooks/photos/7Y0ZVBWCfNw.jpg'))
#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
#Compute cosine similarities
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [6], in <cell line: 2>()
1 #Encode an image:
----> 2 img_emb = model.encode(Image.open('/Users/davidlaxer/Concept/notebooks/photos/kYobqI1URDg.jpg'))
4 #Encode text descriptions
5 text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
File ~/tensorflow-metal/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py:153, in SentenceTransformer.encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
151 for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
152 sentences_batch = sentences_sorted[start_index:start_index+batch_size]
--> 153 features = self.tokenize(sentences_batch)
154 features = batch_to_device(features, device)
156 with torch.no_grad():
File ~/tensorflow-metal/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py:311, in SentenceTransformer.tokenize(self, texts)
307 def tokenize(self, texts: Union[List[str], List[Dict], List[Tuple[str, str]]]):
308 """
309 Tokenizes the texts
310 """
--> 311 return self._first_module().tokenize(texts)
File ~/tensorflow-metal/lib/python3.8/site-packages/sentence_transformers/models/CLIPModel.py:71, in CLIPModel.tokenize(self, texts)
68 if len(images) == 0:
69 images = None
---> 71 inputs = self.processor(text=texts_values, images=images, return_tensors="pt", padding=True)
72 inputs['image_text_info'] = image_text_info
73 return inputs
File ~/tensorflow-metal/lib/python3.8/site-packages/transformers/models/clip/processing_clip.py:148, in CLIPProcessor.__call__(self, text, images, return_tensors, **kwargs)
145 encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
147 if images is not None:
--> 148 image_features = self.feature_extractor(images, return_tensors=return_tensors, **kwargs)
150 if text is not None and images is not None:
151 encoding["pixel_values"] = image_features.pixel_values
File ~/tensorflow-metal/lib/python3.8/site-packages/transformers/models/clip/feature_extraction_clip.py:150, in CLIPFeatureExtractor.__call__(self, images, return_tensors, **kwargs)
148 images = [self.center_crop(image, self.crop_size) for image in images]
149 if self.do_normalize:
--> 150 images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
152 # return as BatchFeature
153 data = {"pixel_values": images}
File ~/tensorflow-metal/lib/python3.8/site-packages/transformers/models/clip/feature_extraction_clip.py:150, in <listcomp>(.0)
148 images = [self.center_crop(image, self.crop_size) for image in images]
149 if self.do_normalize:
--> 150 images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
152 # return as BatchFeature
153 data = {"pixel_values": images}
File ~/tensorflow-metal/lib/python3.8/site-packages/transformers/image_utils.py:143, in ImageFeatureExtractionMixin.normalize(self, image, mean, std)
141 return (image - mean[:, None, None]) / std[:, None, None]
142 else:
--> 143 return (image - mean) / std
ValueError: operands could not be broadcast together with shapes (4,224,224) (3,)
Packages:
print(transformers.__version__)
import transformers
print(transformers.__version__)
4.11.3
__version__
import sentence_transformers
print(sentence_transformers.__version__)
2.1.0
import numpy
print(numpy.__version__)
1.20.3
pix = numpy.array(image)
pix.mean()
71.51163167317708
pix.std()
106.86995132579301
pix.shape
pix.shape
(960, 640, 4)
I also encountered the same problem. Pictures with QR codes will inevitably have such problems.
You could convert your image to RGB with PIL before encoding it. In your case, it would be:
#Encode an image:
img_emb = model.encode(Image.open('.notebooks/photos/7Y0ZVBWCfNw.jpg').convert("RGB"))