Magma UI Finetuning and evaluation script

@jwyang, Thank you again for this wonderful work. Can you please tell when are you planning to release finetuning code for UI?

May 05 '25 11:05 jayabrata97

On the way! but actually it is pretty easy to set up given currnt codebase and datasets I already uploaded to HF.

May 09 '25 15:05 jwyang

Hi @jwyang, also can you please let me know which script to run to check the results on Mind2Web for web UI navigation as detailed in Table 4 in the paper?

May 15 '25 07:05 jayabrata97

Hi @jwyang, I was following the comment by @srvmishra at https://github.com/microsoft/Magma/issues/77#issuecomment-2888804213 . For point 2 raised by @svrmishra, I have tried to generate the SoM on the image as directed. I am facing the same as @svrmishra. I used the following code to generate the SoM image. I adapted this code from agents/ui_agent/app.py and as directed here by you: https://github.com/microsoft/Magma/issues/64#issuecomment-2840340160. I am attaching the code I have used to generate SoM on image.

from util.som import *
from util.utils import *
import numpy as np
import cv2
from huggingface_hub import snapshot_download

repo_id = "microsoft/OmniParser-v2.0"
local_dir = './weights'
snapshot_download(repo_id=repo_id, local_dir=local_dir)

image_path = "./assets/images/ui_agent_example.png"
yolo_model = get_yolo_model(model_path='weights/icon_detect/model.pt')
caption_model_processor = get_caption_model_processor(model_name="florence2", model_name_or_path="./weights/icon_caption")
image_input = Image.open(image_path).convert("RGB")


box_overlay_ratio = image_input.size[0] / 3200
draw_bbox_config = {
    'text_scale': 0.8 * box_overlay_ratio,
    'text_thickness': max(int(2 * box_overlay_ratio), 1),
    'text_padding': max(int(3 * box_overlay_ratio), 1),
    'thickness': max(int(3 * box_overlay_ratio), 1),
}

ocr_bbox_rslt, is_goal_filtered = check_ocr_box(image_input, display_img = False, 
                                                output_bb_format='xyxy', goal_filtering=None, 
                                                easyocr_args={'paragraph': False, 'text_threshold':0.9}, 
                                                use_paddleocr=False)
text, ocr_bbox = ocr_bbox_rslt
dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_input, yolo_model, 
                                                                            BOX_TRESHOLD=0.01, output_coord_in_ratio=False, 
                                                                            ocr_bbox=ocr_bbox, draw_bbox_config=draw_bbox_config, 
                                                                            caption_model_processor=caption_model_processor, 
                                                                            ocr_text=text, iou_threshold=0.9, imgsz=None)


# dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_path, yolo_model)
image = Image.open(io.BytesIO(base64.b64decode(dino_labled_img)))
# image.show()
image_ = np.array(image)[:,:,::-1]
window_name = 'image'
cv2.imshow(window_name, image_)
cv2.waitKey(0)
cv2.destroyAllWindows()

Here is the original image:

Here is the output:

As we can see that there are multiple boxes for same UI element as example 141, 45, 46, 47. I used the default parameters for the SoM generation. Could you please let me know the parameters you used to generate the SoM.

Another Query: How did you create SoM images on Mind2Web because the dataset has only HTML files?

May 18 '25 12:05 jayabrata97