OmniParser
OmniParser copied to clipboard
How to guide the caption model with the a11y tree information?
Hi, First of all, thanks for your great work on Omniparser V2!
After reviewing the code in demo.ipynb, I understand that the workflow of Omniparser V2 involves:
- Using an OCR model and a YOLO model to detect text and icons.
- Applying a caption model (e.g., Florence) to generate captions for detected icons.
However, in my experience, the generated captions are sometimes too generic or even unrelated to the actual icon. In my specific scenario, I have a screenshot from Ubuntu along with its corresponding accessibility (a11y) tree. The captions in the a11y tree are generally more accurate, albeit sometimes too short.
To improve captioning quality, I would like to incorporate the a11y tree captions into the prompt used by the caption model. However, in the get_parsed_content_icon function, the prompts for the caption model are structured as follows:
if model.device.type == 'cuda':
inputs = processor(images=batch, text=[prompt]*len(batch), return_tensors="pt", do_resize=False).to(device=device, dtype=torch.float16)
From what I see, the prompt is currently fixed (e.g., "<CAPTION>") and does not vary for different inputs. I am unfamiliar with the Florence model and its processor, so I am unsure how to dynamically integrate the a11y tree captions into the prompt.
Could you provide guidance on how to modify the prompt so that it includes information from the a11y tree for each corresponding icon?
Thanks in advance for your help!