Understand the response from magma server for UI navigation
Hi, thanks for the excellent work!
I deployed the FastAPI server on my machine via Docker.
I used the https://github.com/microsoft/Magma/blob/main/server/test_api.py to test the model inference, simply replace the image and the prompt with the 1st example from the mind2web dataset you released:
However I got the following response, which doesn't align to the assistant ground truth like { "from": "assistant", "value": "{\"ACTION\": \"TYPE\", \"MARK\": 11, \"VALUE\": \"Jeerimiah Waton\"}" } ]
Response text: Coordinate: (0.70, 0.10). Mark: 16.
Coordinate: (0.70, 0.40). Mark: 17.
Coordinate: (0.40, 0.10). Mark: 18.
Coordinate: (0.40, 0.10). Mark: 19.
Coordinate: (0.70, 0.54). Mark: 20.
Coordinate: (0.70, 0.18). Mark: 21.
Coordinate: (0.70, 0.43). Mark: 22.
Coordinate: (0.70, 0.49). Mark: 23.
Coordinate: (0.70, 0.16). Mark: 24.
Coordinate: (0.70, 0.20). Mark: 25.
Coordinate: (0.85, 0.65). Mark: 26.
Coordinate: (0.37, 1.02). Mark: 27.
Normalized actions: [0.996078431372549, 0.996078431372549, 0.996078431372549, 0.996078431372549, 0.996078431372549, 0.996078431372549, 0.996078431372549]
Delta values: [0.04980392156862745, 0.04980392156862745, 0.04980392156862745, 3.1276862745098044, 3.1276862745098044, 3.1276862745098044, 0.9980392156862745]
Did I make any mistakes?
I think so, the FastAPI was never tested with UI screenshot. There is no gurantee for this. It would be better if you start with the model inference example:
https://github.com/microsoft/Magma?tab=readme-ov-file#inference-with-huggingface-transformers
or the UI demo:
https://github.com/microsoft/Magma?tab=readme-ov-file#ui-agent