big_vision A simple question about Image Resolution of NaFlex version

Thanks for your awesome work!

When I was viewing the siglip2's image processing Github code, I found that the image will first go through a function named "get_image_size_for_max_num_patches" and this function will lead to serious decrease of the original image resolution. For example , an image from one arXiv scientific paper PDF is 22001700 , however, it will get decreased to 288224 after going through this function. I don't know if it is something wrong in my understanding or it is what siglip2-so400m-16-NaFlex does----it will decrease resolution before process. I know there is a limit of "max_seq_length" which is 256 , but I wondering if 228*224 is enough for Document-level parsing? Isn't too small for siglip2's training aimed to solving downstream tasks such as Document-Understanding ? Or actually I can modify the max_seq_length or something else to ensure an enough resolution before going through vision encoder? Can you clear up my doubts？

Mar 19 '25 03:03 Baitlo

You can increase max_seq_length to get higher resolution images after preprocessing. The maximum sequence length which NaFlex models were trained on is 1024. If you use the model zero-shot, you might see a decrease in zero-shot metrics if you go above (but it might be worth a shot). If you train a new model on top (e.g. a VLM) you might still get good performance above length 1024. Alternatively, you can split your images into aspect-preserving tiles and process these separately (again, if you're training a model on top of the representation).

Mar 19 '25 09:03 mitscha

You can increase max_seq_length to get higher resolution images after preprocessing. The maximum sequence length which NaFlex models were trained on is 1024. If you use the model zero-shot, you might see a decrease in zero-shot metrics if you go above (but it might be worth a shot). If you train a new model on top (e.g. a VLM) you might still get good performance above length 1024. Alternatively, you can split your images into aspect-preserving tiles and process these separately (again, if you're training a model on top of the representation).你可以增加 max_seq_length 以在预处理后获得更高分辨率的图像。NaFlex 模型训练时所使用的最大序列长度为 1024。如果你进行零样本使用该模型，当序列长度超过此值时，零样本指标可能会下降（但这或许值得一试）。如果你在此基础上训练一个新模型（例如视觉语言模型），即使序列长度超过 1024，仍可能获得良好的性能。另外，你可以将图像分割成保持纵横比的图块并分别处理（同样，如果你要在该表示的基础上训练模型）。

Thanks for your reply , happy weekends~

Mar 21 '25 03:03 Baitlo