CLIP icon indicating copy to clipboard operation
CLIP copied to clipboard

RuntimeError: The size of tensor a (2) must match the size of tensor b (50) at non-singleton dimension 1

Open RylanSchaeffer opened this issue 4 years ago • 6 comments

When I load the ViT-B/32 model, like so,

clip_image_and_text_model, preprocess = clip.load("ViT-B/32", device=device)

And then construct a CIFAR100 dataloader using preprocess as the dataloader, I have input tensors of shape (batch size, 3, 32, 32). But when I try calling

clip_image_and_text_model.encode_image(input_tensor)

I get RuntimeError: The size of tensor a (2) must match the size of tensor b (50) at non-singleton dimension 1. The error is caused by line 224 in model.py:

x = x + self.positional_embedding.to(x.dtype)

At this line, x has shape (batch size, 2, 768) whereas self.positional_embedding has shape (batch size, 768).

I don't know what the correct behavior is to know whether x should not have dimension 1 or self.positional_embedding should have dimension 1 or something else entirely.

Could someone please clarify?

RylanSchaeffer avatar Nov 28 '21 21:11 RylanSchaeffer

The comment on line 223 suggests that x should indeed be 3 dimensional, which suggests to me that the positional embedding is wrong. Is this correct?

RylanSchaeffer avatar Nov 28 '21 21:11 RylanSchaeffer

Hey @RylanSchaeffer , I am finding a simillar problem when changing the input size of the model. Did you finally manage to find the proper solution?

Evm7 avatar Jan 13 '22 16:01 Evm7

@Evm7 unfortunately I can't remember :(

RylanSchaeffer avatar Jan 13 '22 19:01 RylanSchaeffer

Facing exactly the same issue here

jogisuda avatar Jan 28 '22 16:01 jogisuda

Found the solution. My problem was the size of the images: I had batches of dimension (16, 3, 32, 32) (16 images per batch, 3 channels, 32 height/width). Got it working when changed the Transforms to resize to dimension 224. So, final shape became (16, 3, 224, 224). Hope it helps!

jogisuda avatar Jan 28 '22 17:01 jogisuda

Found the solution. My problem was the size of the images: I had batches of dimension (16, 3, 32, 32) (16 images per batch, 3 channels, 32 height/width). Got it working when changed the Transforms to resize to dimension 224. So, final shape became (16, 3, 224, 224). Hope it helps!

in other words, the input shape must be (B, 3 , 224, 224) ?

Dinosaurcubs avatar Dec 25 '23 07:12 Dinosaurcubs