OmniParser How to guide the caption model with the a11y tree information?

How to guide the caption model with the a11y tree information?

Open xsgldhy opened this issue 4 days ago • 0 comments

Hi, First of all, thanks for your great work on Omniparser V2!

After reviewing the code in demo.ipynb, I understand that the workflow of Omniparser V2 involves:

Using an OCR model and a YOLO model to detect text and icons.
Applying a caption model (e.g., Florence) to generate captions for detected icons.

However, in my experience, the generated captions are sometimes too generic or even unrelated to the actual icon. In my specific scenario, I have a screenshot from Ubuntu along with its corresponding accessibility (a11y) tree. The captions in the a11y tree are generally more accurate, albeit sometimes too short.

To improve captioning quality, I would like to incorporate the a11y tree captions into the prompt used by the caption model. However, in the get_parsed_content_icon function, the prompts for the caption model are structured as follows:

if model.device.type == 'cuda':
    inputs = processor(images=batch, text=[prompt]*len(batch), return_tensors="pt", do_resize=False).to(device=device, dtype=torch.float16)

From what I see, the prompt is currently fixed (e.g., "<CAPTION>") and does not vary for different inputs. I am unfamiliar with the Florence model and its processor, so I am unsure how to dynamically integrate the a11y tree captions into the prompt.

Could you provide guidance on how to modify the prompt so that it includes information from the a11y tree for each corresponding icon?

Thanks in advance for your help!

Feb 19 '25 02:02 xsgldhy

OmniParser OmniParser copied to clipboard

How to guide the caption model with the a11y tree information?

OmniParser
OmniParser copied to clipboard