A simple question about Image Resolution of NaFlex version
Thanks for your awesome work!
When I was viewing the siglip2's image processing Github code, I found that the image will first go through a function named "get_image_size_for_max_num_patches" and this function will lead to serious decrease of the original image resolution. For example , an image from one arXiv scientific paper PDF is 22001700 , however, it will get decreased to 288224 after going through this function. I don't know if it is something wrong in my understanding or it is what siglip2-so400m-16-NaFlex does----it will decrease resolution before process. I know there is a limit of "max_seq_length" which is 256 , but I wondering if 228*224 is enough for Document-level parsing? Isn't too small for siglip2's training aimed to solving downstream tasks such as Document-Understanding ? Or actually I can modify the max_seq_length or something else to ensure an enough resolution before going through vision encoder? Can you clear up my doubts?
You can increase max_seq_length to get higher resolution images after preprocessing. The maximum sequence length which NaFlex models were trained on is 1024. If you use the model zero-shot, you might see a decrease in zero-shot metrics if you go above (but it might be worth a shot). If you train a new model on top (e.g. a VLM) you might still get good performance above length 1024. Alternatively, you can split your images into aspect-preserving tiles and process these separately (again, if you're training a model on top of the representation).
You can increase
max_seq_lengthto get higher resolution images after preprocessing. The maximum sequence length which NaFlex models were trained on is 1024. If you use the model zero-shot, you might see a decrease in zero-shot metrics if you go above (but it might be worth a shot). If you train a new model on top (e.g. a VLM) you might still get good performance above length 1024. Alternatively, you can split your images into aspect-preserving tiles and process these separately (again, if you're training a model on top of the representation).你可以增加max_seq_length以在预处理后获得更高分辨率的图像。NaFlex 模型训练时所使用的最大序列长度为 1024。如果你进行零样本使用该模型,当序列长度超过此值时,零样本指标可能会下降(但这或许值得一试)。如果你在此基础上训练一个新模型(例如视觉语言模型),即使序列长度超过 1024,仍可能获得良好的性能。另外,你可以将图像分割成保持纵横比的图块并分别处理(同样,如果你要在该表示的基础上训练模型)。
Thanks for your reply , happy weekends~