[BUG] naflexvit_so400m_patch16_siglip has undocumented different default pos_embed_interp_mode of "bicubic" instead of "bilinear"
Updates
Per further discussion, the difference is intentional, but undocumented. It is a difference with the reference implementation from Google Big Vision.
Original Report
Fix location: https://github.com/huggingface/pytorch-image-models/blob/a7c5368ba0c8713dc1c9a98cc83bf46ddd02b0a0/timm/models/naflexvit.py#L1767
This causes the default to be "bicubic": https://github.com/huggingface/pytorch-image-models/blob/a7c5368ba0c8713dc1c9a98cc83bf46ddd02b0a0/timm/models/naflexvit.py#L90
Reference code showing "bilinear" interpolation: https://github.com/google-research/big_vision/blob/0127fb6b337ee2a27bf4e54dea79cff176527356/big_vision/models/proj/image_text/naflex_vit.py#L67
After making this change, TIMM is able to forward siglip2 naflex with cosine similarly at each intermediate above 0.9999.
@redhottensors I'm aware of the difference, it can be changed in the config easily, but there are also differences btw the torch bilinear and jax ... how much worse is the similarity as it is right now in your comparisons? In practical terms for zero-shot eval, etc it didn't seem to make much difference.
Cosine similarities of the intermediate hidden states on a test image (pre-resized to 384x384, also made sure both are running on 576 seqlen) are roughly 0.985-0.99, and the pooled outputs are cosine similarity 0.995. When using bilinear they're all 1.0.
https://github.com/huggingface/pytorch-image-models/pull/2543#issuecomment-3053707369 "I evaluated both in zero-shot and bicubic appeared to 'win' across a few scenarios."
Fair enough. In that case, is this documented somewhere?
@redhottensors no, it is not document. It was based on some evals while I was verifying model correctness. I haven't finished full integration with OpenCLIP but will check again and report when I do some time this month. For now I'd like to leave as is without additional evidence.
It is easy to override if you have concerns, mm = timm.create_model('naflexvit_base_patch16_siglip', pos_embed_interp_mode='bilinear') ... this can also be included as a model arg in a timm config for any model pushed to the hub so it will be defaulted without specifying additional args on model creation (when there are weights to load).
I'd recommend comparing the two options in your use case, not comparing against the decisions made in transformers.
Understood. I consider this difference being undocumented as a bug and will not close. This is not a difference just with transformers, but also a difference vs the reference code from Google Big Vision.
@redhottensors I have no problem leaving this open, the reference code is less important than the results when you're switching frameworks. Interpolation is implemented differently.
I will change the issue title and summary to indicate that I now consider this a documentation bug.
Thank you @rwightman for your quick response and careful consideration.
@drhead @redhottensors no worries, also if you do find anything on your end, feel free to share here as I will take into account when I finalize and push some weights for the encoders and for OpenCLIP.