Awni Hannun comments

Results 1014 comments of


                                            Awni Hannun

Llama-3.1-8B-Instruct-4bit keeps looping at the end.

Ok I’ll take a look at that. But it’s not in MLX 0.19 so it can’t really be the same issue as above..

Llama-3.1-8B-Instruct-4bit keeps looping at the end.

@chigkim were you building the main branch of MLX from source or did you install MLX from PyPi?

Llama-3.1-8B-Instruct-4bit keeps looping at the end.

I've tried to reproduce this on several machines (M1 Max, M2 Ultra, M1 Ultra, and M3 Max), so far not seeing any issues in the output. Some questions / suggestions:...

Llama-3.1-8B-Instruct-4bit keeps looping at the end.

I'm really stumped by this one to be honest. I tried your exact command which works fine on several machines: ``` mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0...

Llama-3.1-8B-Instruct-4bit keeps looping at the end.

That is really curious.. There was a bug in one of our qmv kernels that was [recently fixed](https://github.com/ml-explore/mlx/pull/1577). It might be possible (but I think pretty unlikely) that this would...

Llama-3.1-8B-Instruct-4bit keeps looping at the end.

Ok. Let's see if it's fixed after our next release (which includes a fix for quantization in some cases https://github.com/ml-explore/mlx/pull/1577). If it's not fixed, I will try fuzzing around a...

Llama-3.1-8B-Instruct-4bit keeps looping at the end.

No you can use the same model (no requantization needed). You can test by pulling and building the main branch. It would be great to know if that works for...

Llama-3.1-8B-Instruct-4bit keeps looping at the end.

Interesting.. you can see all the commits between 18.1 and 19.0 here: https://github.com/ml-explore/mlx/compare/v0.18.1...v0.19.0 The commit in there that seems most likely to have changed something for LLM inference is the...

Llama-3.1-8B-Instruct-4bit keeps looping at the end.

Could you try running with Metal validation enabled to see if that gives us any clues? (Low probability but when it hits it hits well): `METAL_DEVICE_WRAPPER_TYPE=1 METAL_DEBUG_ERROR_MODE=0 mlx_lm.generate ...`

Llama-3.1-8B-Instruct-4bit keeps looping at the end.

Also you can precompute the prompt cache to speed testing up: ``` mlx_lm.cache_prompt --prompt-cache-file prompt.safetensors --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --prompt -