mlx-vlm
mlx-vlm copied to clipboard
Inaccurate Coordinate Outputs for MLX-Quantized UI-TARS-1.5 (4bit/6bit)
We've tested the quantized version of the UI-TARS-1.5 model (4-bit and 6-bit quantization) implemented with MLX. The work-in-progress implementation can be found here:
https://github.com/trycua/cua/tree/feature/agent/uitars-mlx
Problems Observed:
- The model struggles to output accurate (x, y) click coordinates when running under MLX quantization (both 4-bit and 6-bit).
- Specifically, the outputs often incorrectly target areas such as the center of the screen
- It’s possible that the quantized models are particularly sensitive to slight implementation differences compared to the full precision models (tested on cloud inference).
Testing Setup:
- Models tested locally on Mac with MLX (Apple Silicon).
- Observed significant performance hit when running model + c/ua VM simultaneously.
- Full precision model hosted on AWS endpoint produced correct behavior, suggesting the issue is specific to the MLX quantization.
Artifacts:
- Trajectories comparison archive (full precision vs 6-bit quantized outputs) attached:
cua_uitars_trajectories.zip
Environment Details:
- Mac M3 Pro (local testing)
- AWS EC2 for cloud model hosting (full precision comparison)
- Cua framework for running agents (https://github.com/trycua/cua)
@francedot this should be similar to this https://github.com/Blaizzy/mlx-vlm/pull/319
@francedot can you take a pull from my fork and validate if it works as expected?
@francedot can you take a pull from my fork and validate if it works as expected?
i pulled from your fork (commit b355cf1a85170957f626fa58c4c8dd205c92dd64 from today), but i am still noticing a bug with the coordinate outputs:
mlx-vlm ui-tars:
https://github.com/user-attachments/assets/a133b909-0385-46e9-854c-d351d2872813
torch ui-tars:
https://github.com/user-attachments/assets/4c7d6e5c-b167-4189-81bf-5b50492bdf27
misc environment details:
- using this code for running the agent: https://github.com/trycua/cua/commit/6a6fe48dbca0bd8f17652c538e08183ba289eefe
- and the following prompt:
"please drag a line from the red circle to the green circle, then open a new tab and go to reddit\n\n(You are operating on macOS, use 'cmd' instead of 'ctrl' for most shortcuts e.g., hotkey(key='cmd c') for copy, hotkey(key='cmd v') for paste, hotkey(key='cmd t') for new tab).)"
hey @prncvrm any fix for the above? keen on trying to get the mlx model to work
yes, the new UITars1.5 uses qwen2.5VL, while the fix i've raised was for Qwen2VL they are similar issue, i'll try to fix and raise a PR by this weekend hopefully
Awesome, feel free to ping me when it ready!
I also left a comment on your current PR.
https://github.com/Blaizzy/mlx-vlm/pull/349 the PR's here, thanks to @ddupont808 🚀