CLIP-ReID Confusion about adapting position embedding with resolution change in ViT backbone

Confusion about adapting position embedding with resolution change in ViT backbone

Open BorgDiven opened this issue 10 months ago • 1 comments

Hello Author,

Thank you for your work on Clip-Reid. I'm facing some confusion regarding the position embedding adaptation of Vision Transformer (ViT) backbone when altering the resolution and the correct loading of CLIP model weights.

The original CLIP model is trained with a certain resolution, and I understand that the position embeddings are tied to this specific resolution. When the input resolution is changed, it's unclear to me how the position embeddings should be adapted.

When loading the CLIP weights with a modified input resolution, are there any special considerations or steps to ensure the weights are loaded correctly?

I've gone through the documentation and issues but haven't found a clear explanation on this topic. Any guidance, documentation references, or examples would be greatly appreciated.

Thank you for your time and assistance.

Best regards

Apr 08 '24 03:04 BorgDiven

CLIP-ReID CLIP-ReID copied to clipboard

Confusion about adapting position embedding with resolution change in ViT backbone

CLIP-ReID
CLIP-ReID copied to clipboard