mlx-vlm Inaccurate Coordinate Outputs for MLX-Quantized UI-TARS-1.5 (4bit/6bit)

trafficstars

We've tested the quantized version of the UI-TARS-1.5 model (4-bit and 6-bit quantization) implemented with MLX. The work-in-progress implementation can be found here:
https://github.com/trycua/cua/tree/feature/agent/uitars-mlx

Problems Observed:

The model struggles to output accurate (x, y) click coordinates when running under MLX quantization (both 4-bit and 6-bit).
Specifically, the outputs often incorrectly target areas such as the center of the screen
It’s possible that the quantized models are particularly sensitive to slight implementation differences compared to the full precision models (tested on cloud inference).

Testing Setup:

Models tested locally on Mac with MLX (Apple Silicon).
Observed significant performance hit when running model + c/ua VM simultaneously.
Full precision model hosted on AWS endpoint produced correct behavior, suggesting the issue is specific to the MLX quantization.

Artifacts:

Trajectories comparison archive (full precision vs 6-bit quantized outputs) attached:
cua_uitars_trajectories.zip

Environment Details:

Mac M3 Pro (local testing)
AWS EC2 for cloud model hosting (full precision comparison)
Cua framework for running agents (https://github.com/trycua/cua)

cua uitars trajectories.zip

Apr 28 '25 23:04 francedot

@francedot this should be similar to this https://github.com/Blaizzy/mlx-vlm/pull/319

May 01 '25 13:05 prncvrm

@francedot can you take a pull from my fork and validate if it works as expected?

May 02 '25 08:05 prncvrm

@francedot can you take a pull from my fork and validate if it works as expected?

i pulled from your fork (commit b355cf1a85170957f626fa58c4c8dd205c92dd64 from today), but i am still noticing a bug with the coordinate outputs:

mlx-vlm ui-tars:

https://github.com/user-attachments/assets/a133b909-0385-46e9-854c-d351d2872813

torch ui-tars:

https://github.com/user-attachments/assets/4c7d6e5c-b167-4189-81bf-5b50492bdf27

misc environment details:

using this code for running the agent: https://github.com/trycua/cua/commit/6a6fe48dbca0bd8f17652c538e08183ba289eefe
and the following prompt: "please drag a line from the red circle to the green circle, then open a new tab and go to reddit\n\n(You are operating on macOS, use 'cmd' instead of 'ctrl' for most shortcuts e.g., hotkey(key='cmd c') for copy, hotkey(key='cmd v') for paste, hotkey(key='cmd t') for new tab).)"

May 05 '25 15:05 ddupont808

hey @prncvrm any fix for the above? keen on trying to get the mlx model to work

May 06 '25 14:05 GaleiqTesting

yes, the new UITars1.5 uses qwen2.5VL, while the fix i've raised was for Qwen2VL they are similar issue, i'll try to fix and raise a PR by this weekend hopefully

May 06 '25 14:05 prncvrm

Awesome, feel free to ping me when it ready!

I also left a comment on your current PR.

May 06 '25 22:05 Blaizzy

https://github.com/Blaizzy/mlx-vlm/pull/349 the PR's here, thanks to @ddupont808 🚀

May 14 '25 12:05 prncvrm

mlx-vlm mlx-vlm copied to clipboard

Inaccurate Coordinate Outputs for MLX-Quantized UI-TARS-1.5 (4bit/6bit)

mlx-vlm
mlx-vlm copied to clipboard