CLIP
CLIP copied to clipboard
RuntimeError: The size of tensor a (2) must match the size of tensor b (50) at non-singleton dimension 1
When I load the ViT-B/32 model, like so,
clip_image_and_text_model, preprocess = clip.load("ViT-B/32", device=device)
And then construct a CIFAR100 dataloader using preprocess as the dataloader, I have input tensors of shape (batch size, 3, 32, 32). But when I try calling
clip_image_and_text_model.encode_image(input_tensor)
I get RuntimeError: The size of tensor a (2) must match the size of tensor b (50) at non-singleton dimension 1. The error is caused by line 224 in model.py:
x = x + self.positional_embedding.to(x.dtype)
At this line, x has shape (batch size, 2, 768) whereas self.positional_embedding has shape (batch size, 768).
I don't know what the correct behavior is to know whether x should not have dimension 1 or self.positional_embedding should have dimension 1 or something else entirely.
Could someone please clarify?
The comment on line 223 suggests that x should indeed be 3 dimensional, which suggests to me that the positional embedding is wrong. Is this correct?
Hey @RylanSchaeffer , I am finding a simillar problem when changing the input size of the model. Did you finally manage to find the proper solution?
@Evm7 unfortunately I can't remember :(
Facing exactly the same issue here
Found the solution. My problem was the size of the images: I had batches of dimension (16, 3, 32, 32) (16 images per batch, 3 channels, 32 height/width). Got it working when changed the Transforms to resize to dimension 224. So, final shape became (16, 3, 224, 224). Hope it helps!
Found the solution. My problem was the size of the images: I had batches of dimension (16, 3, 32, 32) (16 images per batch, 3 channels, 32 height/width). Got it working when changed the Transforms to resize to dimension 224. So, final shape became (16, 3, 224, 224). Hope it helps!
in other words, the input shape must be (B, 3 , 224, 224) ?