[BUG] naflexvit_so400m_patch16_siglip has undocumented different default pos_embed_interp_mode of "bicubic" instead of "bilinear"

Open redhottensors opened this issue 5 months ago • 9 comments

Updates

Per further discussion, the difference is intentional, but undocumented. It is a difference with the reference implementation from Google Big Vision.

Original Report

Fix location: https://github.com/huggingface/pytorch-image-models/blob/a7c5368ba0c8713dc1c9a98cc83bf46ddd02b0a0/timm/models/naflexvit.py#L1767

This causes the default to be "bicubic": https://github.com/huggingface/pytorch-image-models/blob/a7c5368ba0c8713dc1c9a98cc83bf46ddd02b0a0/timm/models/naflexvit.py#L90

Reference code showing "bilinear" interpolation: https://github.com/google-research/big_vision/blob/0127fb6b337ee2a27bf4e54dea79cff176527356/big_vision/models/proj/image_text/naflex_vit.py#L67

After making this change, TIMM is able to forward siglip2 naflex with cosine similarly at each intermediate above 0.9999.

Jul 09 '25 18:07 redhottensors

@redhottensors I'm aware of the difference, it can be changed in the config easily, but there are also differences btw the torch bilinear and jax ... how much worse is the similarity as it is right now in your comparisons? In practical terms for zero-shot eval, etc it didn't seem to make much difference.

Jul 09 '25 18:07 rwightman

Cosine similarities of the intermediate hidden states on a test image (pre-resized to 384x384, also made sure both are running on 576 seqlen) are roughly 0.985-0.99, and the pooled outputs are cosine similarity 0.995. When using bilinear they're all 1.0.

Jul 09 '25 18:07 drhead

https://github.com/huggingface/pytorch-image-models/pull/2543#issuecomment-3053707369 "I evaluated both in zero-shot and bicubic appeared to 'win' across a few scenarios."

Fair enough. In that case, is this documented somewhere?

Jul 09 '25 19:07 redhottensors

@redhottensors no, it is not document. It was based on some evals while I was verifying model correctness. I haven't finished full integration with OpenCLIP but will check again and report when I do some time this month. For now I'd like to leave as is without additional evidence.

It is easy to override if you have concerns, mm = timm.create_model('naflexvit_base_patch16_siglip', pos_embed_interp_mode='bilinear') ... this can also be included as a model arg in a timm config for any model pushed to the hub so it will be defaulted without specifying additional args on model creation (when there are weights to load).

I'd recommend comparing the two options in your use case, not comparing against the decisions made in transformers.

Jul 09 '25 19:07 rwightman

Understood. I consider this difference being undocumented as a bug and will not close. This is not a difference just with transformers, but also a difference vs the reference code from Google Big Vision.

Jul 09 '25 19:07 redhottensors

@redhottensors I have no problem leaving this open, the reference code is less important than the results when you're switching frameworks. Interpolation is implemented differently.

Jul 09 '25 19:07 rwightman

I will change the issue title and summary to indicate that I now consider this a documentation bug.

Jul 09 '25 19:07 redhottensors

Thank you @rwightman for your quick response and careful consideration.

Jul 09 '25 19:07 redhottensors

@drhead @redhottensors no worries, also if you do find anything on your end, feel free to share here as I will take into account when I finalize and push some weights for the encoders and for OpenCLIP.

Jul 09 '25 19:07 rwightman