Agent-S icon indicating copy to clipboard operation
Agent-S copied to clipboard

Gemini models as grounding agent x,y point outputs are inverted and normalized.

Open Tsukalos opened this issue 7 months ago • 3 comments

Using the framework with Gemini 2.5 Flash as the base and grounding agents, I found it difficult to get the agent to correctly set the location of various icons or even the Wikipedia search bar.

Looking deeper, the image understanding documentation for the Gemini models states the following:

The coordinates are returned relative to the image dimensions, scaled to [0, 1000]. You need to descale these coordinates based on your original image size.

Output: Bounding boxes in the [y_min, x_min, y_max, x_max] format. The top-left corner is the origin. The x and y axes run vertically and horizontally, respectively. Coordinate values are normalized to 0–1000 for every image.

This description pertains to an object detection task, but my own tests suggest that the model API returns points in the (y, x) format.

Using Google’s AI Studio, I provided a screenshot of my desktop and asked it to give me the coordinates of a specific icon. While the icon is located around [1128, 490], the model's output for both 2.5 Flash and Pro was [472, 592].

Interpreting this output as [y, x] and accounting for the [0, 1000] normalization, the actual coordinate becomes [1136, 509], which is close enough to the expected value.

EDIT: in the agent, the RAW GROUNDING MODEL RESPONSE does not get inverted, but the normalized 0-1000 range might be worth documenting.

Tsukalos avatar May 16 '25 18:05 Tsukalos

I have the same issue here with every model.

Did any model work for you?

johnmalek312 avatar May 18 '25 06:05 johnmalek312

Using any of the gemini models worked as grounding when using the screenshot with 1920x1080 resolution, but to deal with the 0-1000 normalized output the api returns I used the size params in the grounding engine definition which work to de-normalize the output automatically:

  "grounding_width": 1000,
  "grounding_height": 1000,

Also, would be nice to double check if any of the outputs are (x,y) and not (y,x) as google api docs state.

Tsukalos avatar May 19 '25 13:05 Tsukalos

Using any of the gemini models worked as grounding when using the screenshot with 1920x1080 resolution, but to deal with the 0-1000 normalized output the api returns I used the size params in the grounding engine definition which work to de-normalize the output automatically:

  "grounding_width": 1000,
  "grounding_height": 1000,

Also, would be nice to double check if any of the outputs are (x,y) and not (y,x) as google api docs state.

Are you saying that you need to rescale gemini output in order to be correct output?

SkeletonMask avatar May 20 '25 13:05 SkeletonMask

I can confirm the same, Gemini returns a value between 0-1000 that has to be rescaled to the original screen size. Since the default --grounding_model_resize_width is 1366, the default setting won't work.

I'm wondering what's the deal with other models, I'm unable to find a model that actually works. Also ui-tars got removed from OpenRouter, so I'm unable to test it

DaWe35 avatar May 23 '25 18:05 DaWe35

Okay wtf Gemini just returned a response bigger than 1000, clicking outside my screen

Image

DaWe35 avatar May 23 '25 19:05 DaWe35