CLIP Number of dims don't match in permute

In my task, I want to calculate the similarity score of 1 image with 22 texts. If I pass images and texts without batch, it will work properly. But if I pass as batch it's showing the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[76], line 14
     12 with torch.no_grad():
     13     image_features = model.encode_image(image_input)
---> 14     text_features = model.encode_text(text_inputs.unsqueeze(0))
     16 # Pick the top 5 most similar labels for the image
     17 image_features /= image_features.norm(dim=-1, keepdim=True)

File /media/shantu/Study/anaconda3/envs/thesis/lib/python3.8/site-packages/clip/model.py:347, in CLIP.encode_text(self, text)
    344 x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]
    346 x = x + self.positional_embedding.type(self.dtype)
--> 347 x = x.permute(1, 0, 2)  # NLD -> LND
    348 x = self.transformer(x)
    349 x = x.permute(1, 0, 2)  # LND -> NLD

RuntimeError: number of dims don't match in permute

Input Image Size: (1, 3, 224, 224) (batch_size, n_channel, image width, image height) Input text size: (1, 22, 77)(batch_size, n_txt, text_length)

Feb 20 '23 13:02 shantanu778

I got the same issue! Did you manage to solve it?

Feb 24 '23 10:02 Angtrim

Apparently the model is built to get "flattened" batched classes (or at least it seems to me). So, in order to get classify a single image over multiple batches of classes I've implemented the following script that uses the cosine similarity to do the softmax:

# A way to do batch evaluation on different batches of list of classes
# BEWARE: batches of list of classes means something like [ ["red", "blue", "green"], ["tall","short"] ]
# not something like: ["red","green", "blue", "yellow"]
# returns a list of probabilities like: [ [0.75, 0.15, 0.10], [0.5, 0.5] ]
def batch_evaluation(image, classes_batch):
    global clip_preprocess
    global clip_model
    image = clip_preprocess(image).unsqueeze(0).to(device)

    # build a 1-D list of tokenized class names
    all_classes = [] 
    # a list to remember the number of elements in every class
    classes_dimension = []
    for single_class_list in classes_batch:
        classes_dimension.append(len(single_class_list))
        for class_item in single_class_list:
            all_classes.append(class_item)

    # Let's tokenize everything
    text_inputs = clip.tokenize(all_classes)
    
    with torch.no_grad():
        image_features = clip_model.encode_image(image)
        text_features = clip_model.encode_text(text_inputs)

    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)


    probabilities = []
    # iterate over the number of classes_list
    for i in range(0, len(classes_dimension)):
        if i == 0:
            text_sliced = text_features[0:classes_dimension[i]]
        else:
            text_sliced = text_features[classes_dimension[i-1]:(classes_dimension[i-1] + classes_dimension[i])]
        print(text_features.shape)
        similarity = (100.0 * image_features @ text_sliced.T).softmax(dim=-1)
        probabilities.append(similarity)
    print(probabilities)

Feb 28 '23 11:02 Angtrim

Same issue here. Any solutions so far?

Nov 18 '23 00:11 JakobWong