mlx-vlm icon indicating copy to clipboard operation
mlx-vlm copied to clipboard

Inaccurate Coordinate Outputs for MLX-Quantized UI-TARS-1.5 (4bit/6bit)

Open francedot opened this issue 6 months ago • 7 comments
trafficstars

We've tested the quantized version of the UI-TARS-1.5 model (4-bit and 6-bit quantization) implemented with MLX. The work-in-progress implementation can be found here:
https://github.com/trycua/cua/tree/feature/agent/uitars-mlx

Problems Observed:

  • The model struggles to output accurate (x, y) click coordinates when running under MLX quantization (both 4-bit and 6-bit).
  • Specifically, the outputs often incorrectly target areas such as the center of the screen
  • It’s possible that the quantized models are particularly sensitive to slight implementation differences compared to the full precision models (tested on cloud inference).

Testing Setup:

  • Models tested locally on Mac with MLX (Apple Silicon).
  • Observed significant performance hit when running model + c/ua VM simultaneously.
  • Full precision model hosted on AWS endpoint produced correct behavior, suggesting the issue is specific to the MLX quantization.

Artifacts:

  • Trajectories comparison archive (full precision vs 6-bit quantized outputs) attached:
    cua_uitars_trajectories.zip

Environment Details:

  • Mac M3 Pro (local testing)
  • AWS EC2 for cloud model hosting (full precision comparison)
  • Cua framework for running agents (https://github.com/trycua/cua)

cua uitars trajectories.zip

francedot avatar Apr 28 '25 23:04 francedot

@francedot this should be similar to this https://github.com/Blaizzy/mlx-vlm/pull/319

prncvrm avatar May 01 '25 13:05 prncvrm

@francedot can you take a pull from my fork and validate if it works as expected?

prncvrm avatar May 02 '25 08:05 prncvrm

@francedot can you take a pull from my fork and validate if it works as expected?

i pulled from your fork (commit b355cf1a85170957f626fa58c4c8dd205c92dd64 from today), but i am still noticing a bug with the coordinate outputs:

mlx-vlm ui-tars:

https://github.com/user-attachments/assets/a133b909-0385-46e9-854c-d351d2872813

torch ui-tars:

https://github.com/user-attachments/assets/4c7d6e5c-b167-4189-81bf-5b50492bdf27

misc environment details:

  • using this code for running the agent: https://github.com/trycua/cua/commit/6a6fe48dbca0bd8f17652c538e08183ba289eefe
  • and the following prompt: "please drag a line from the red circle to the green circle, then open a new tab and go to reddit\n\n(You are operating on macOS, use 'cmd' instead of 'ctrl' for most shortcuts e.g., hotkey(key='cmd c') for copy, hotkey(key='cmd v') for paste, hotkey(key='cmd t') for new tab).)"

ddupont808 avatar May 05 '25 15:05 ddupont808

hey @prncvrm any fix for the above? keen on trying to get the mlx model to work

GaleiqTesting avatar May 06 '25 14:05 GaleiqTesting

yes, the new UITars1.5 uses qwen2.5VL, while the fix i've raised was for Qwen2VL they are similar issue, i'll try to fix and raise a PR by this weekend hopefully

prncvrm avatar May 06 '25 14:05 prncvrm

Awesome, feel free to ping me when it ready!

I also left a comment on your current PR.

Blaizzy avatar May 06 '25 22:05 Blaizzy

https://github.com/Blaizzy/mlx-vlm/pull/349 the PR's here, thanks to @ddupont808 🚀

prncvrm avatar May 14 '25 12:05 prncvrm