Number of dims don't match in permute
In my task, I want to calculate the similarity score of 1 image with 22 texts. If I pass images and texts without batch, it will work properly. But if I pass as batch it's showing the following error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[76], line 14
12 with torch.no_grad():
13 image_features = model.encode_image(image_input)
---> 14 text_features = model.encode_text(text_inputs.unsqueeze(0))
16 # Pick the top 5 most similar labels for the image
17 image_features /= image_features.norm(dim=-1, keepdim=True)
File /media/shantu/Study/anaconda3/envs/thesis/lib/python3.8/site-packages/clip/model.py:347, in CLIP.encode_text(self, text)
344 x = self.token_embedding(text).type(self.dtype) # [batch_size, n_ctx, d_model]
346 x = x + self.positional_embedding.type(self.dtype)
--> 347 x = x.permute(1, 0, 2) # NLD -> LND
348 x = self.transformer(x)
349 x = x.permute(1, 0, 2) # LND -> NLD
RuntimeError: number of dims don't match in permute
Input Image Size: (1, 3, 224, 224) (batch_size, n_channel, image width, image height) Input text size: (1, 22, 77)(batch_size, n_txt, text_length)
I got the same issue! Did you manage to solve it?
Apparently the model is built to get "flattened" batched classes (or at least it seems to me). So, in order to get classify a single image over multiple batches of classes I've implemented the following script that uses the cosine similarity to do the softmax:
# A way to do batch evaluation on different batches of list of classes
# BEWARE: batches of list of classes means something like [ ["red", "blue", "green"], ["tall","short"] ]
# not something like: ["red","green", "blue", "yellow"]
# returns a list of probabilities like: [ [0.75, 0.15, 0.10], [0.5, 0.5] ]
def batch_evaluation(image, classes_batch):
global clip_preprocess
global clip_model
image = clip_preprocess(image).unsqueeze(0).to(device)
# build a 1-D list of tokenized class names
all_classes = []
# a list to remember the number of elements in every class
classes_dimension = []
for single_class_list in classes_batch:
classes_dimension.append(len(single_class_list))
for class_item in single_class_list:
all_classes.append(class_item)
# Let's tokenize everything
text_inputs = clip.tokenize(all_classes)
with torch.no_grad():
image_features = clip_model.encode_image(image)
text_features = clip_model.encode_text(text_inputs)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
probabilities = []
# iterate over the number of classes_list
for i in range(0, len(classes_dimension)):
if i == 0:
text_sliced = text_features[0:classes_dimension[i]]
else:
text_sliced = text_features[classes_dimension[i-1]:(classes_dimension[i-1] + classes_dimension[i])]
print(text_features.shape)
similarity = (100.0 * image_features @ text_sliced.T).softmax(dim=-1)
probabilities.append(similarity)
print(probabilities)
Same issue here. Any solutions so far?