CLIP
CLIP copied to clipboard
Input image size
I want to use a pretrained vision transformer from clip to extract feature from images. My original image size is 10241024. What is the largest input image size for any clip pretrained version? Resize the image to classical 224224 will have a loss of information. Thanks!
CLIP's ViT-Large can take images of size 336. It's inevitable that there will be loss of information when downscale a 1024x1024 image to 336x336, but I think CLIP is robust enough. Make sure you use the ViT-L which has the smallest patch size (14).