UI Finetuning and evaluation script
@jwyang, Thank you again for this wonderful work. Can you please tell when are you planning to release finetuning code for UI?
On the way! but actually it is pretty easy to set up given currnt codebase and datasets I already uploaded to HF.
Hi @jwyang, also can you please let me know which script to run to check the results on Mind2Web for web UI navigation as detailed in Table 4 in the paper?
Hi @jwyang,
I was following the comment by @srvmishra at https://github.com/microsoft/Magma/issues/77#issuecomment-2888804213 . For point 2 raised by @svrmishra, I have tried to generate the SoM on the image as directed. I am facing the same as @svrmishra. I used the following code to generate the SoM image. I adapted this code from agents/ui_agent/app.py and as directed here by you: https://github.com/microsoft/Magma/issues/64#issuecomment-2840340160. I am attaching the code I have used to generate SoM on image.
from util.som import *
from util.utils import *
import numpy as np
import cv2
from huggingface_hub import snapshot_download
repo_id = "microsoft/OmniParser-v2.0"
local_dir = './weights'
snapshot_download(repo_id=repo_id, local_dir=local_dir)
image_path = "./assets/images/ui_agent_example.png"
yolo_model = get_yolo_model(model_path='weights/icon_detect/model.pt')
caption_model_processor = get_caption_model_processor(model_name="florence2", model_name_or_path="./weights/icon_caption")
image_input = Image.open(image_path).convert("RGB")
box_overlay_ratio = image_input.size[0] / 3200
draw_bbox_config = {
'text_scale': 0.8 * box_overlay_ratio,
'text_thickness': max(int(2 * box_overlay_ratio), 1),
'text_padding': max(int(3 * box_overlay_ratio), 1),
'thickness': max(int(3 * box_overlay_ratio), 1),
}
ocr_bbox_rslt, is_goal_filtered = check_ocr_box(image_input, display_img = False,
output_bb_format='xyxy', goal_filtering=None,
easyocr_args={'paragraph': False, 'text_threshold':0.9},
use_paddleocr=False)
text, ocr_bbox = ocr_bbox_rslt
dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_input, yolo_model,
BOX_TRESHOLD=0.01, output_coord_in_ratio=False,
ocr_bbox=ocr_bbox, draw_bbox_config=draw_bbox_config,
caption_model_processor=caption_model_processor,
ocr_text=text, iou_threshold=0.9, imgsz=None)
# dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_path, yolo_model)
image = Image.open(io.BytesIO(base64.b64decode(dino_labled_img)))
# image.show()
image_ = np.array(image)[:,:,::-1]
window_name = 'image'
cv2.imshow(window_name, image_)
cv2.waitKey(0)
cv2.destroyAllWindows()
Here is the original image:
Here is the output:
As we can see that there are multiple boxes for same UI element as example 141, 45, 46, 47. I used the default parameters for the SoM generation.
Could you please let me know the parameters you used to generate the SoM.
Another Query: How did you create SoM images on Mind2Web because the dataset has only HTML files?