Finetune on downstream tasks

Open dprokhorov17 opened this issue 8 months ago • 1 comments

Hello,

how can this model be further finetuned to downstream tasks such as object localization?

Apr 21 '25 14:04 dprokhorov17

Good questions.
For object localization tasks, we suggest directly outputting the bounding box coordinates.

Two important notes are: (1) Replace the original 1D RoPE with 2D RoPE to better capture spatial relationships. (2) Use dynamic resolution by feeding the actual input image dimensions when representing bounding boxes, points, and other spatial features. This helps the model inherently learn scale information, improving its ability to handle images at different resolutions.

May 28 '25 02:05 Paranioar