Awni Hannun
Awni Hannun
Ok I’ll take a look at that. But it’s not in MLX 0.19 so it can’t really be the same issue as above..
@chigkim were you building the main branch of MLX from source or did you install MLX from PyPi?
I've tried to reproduce this on several machines (M1 Max, M2 Ultra, M1 Ultra, and M3 Max), so far not seeing any issues in the output. Some questions / suggestions:...
I'm really stumped by this one to be honest. I tried your exact command which works fine on several machines: ``` mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0...
That is really curious.. There was a bug in one of our qmv kernels that was [recently fixed](https://github.com/ml-explore/mlx/pull/1577). It might be possible (but I think pretty unlikely) that this would...
Ok. Let's see if it's fixed after our next release (which includes a fix for quantization in some cases https://github.com/ml-explore/mlx/pull/1577). If it's not fixed, I will try fuzzing around a...
No you can use the same model (no requantization needed). You can test by pulling and building the main branch. It would be great to know if that works for...
Interesting.. you can see all the commits between 18.1 and 19.0 here: https://github.com/ml-explore/mlx/compare/v0.18.1...v0.19.0 The commit in there that seems most likely to have changed something for LLM inference is the...
Could you try running with Metal validation enabled to see if that gives us any clues? (Low probability but when it hits it hits well): `METAL_DEVICE_WRAPPER_TYPE=1 METAL_DEBUG_ERROR_MODE=0 mlx_lm.generate ...`
Also you can precompute the prompt cache to speed testing up: ``` mlx_lm.cache_prompt --prompt-cache-file prompt.safetensors --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --prompt -