mlx-examples
mlx-examples copied to clipboard
mistral example seems to hang on Loading model from disk.
I am trying to run the mistral model 7B following the README's instructions (and adding the missing numpy dep)
This is all I see for a very long time
> python mistral.py --prompt "It is a truth universally acknowledged," --temp 0
[INFO] Loading model from disk.
Hardware: Apple M1 Max OS: 14.1.2 (23B92) Python 3.11.5 conda 23.7.4
My python process stays at a really low CPU and I am not seeing any disk access that is worthy of fiddling with 14GB of data.
Any way to better debug this or get to something working?
Printed out logs and found it hangs here as probably expected. Unsure how to debug why the mlx library is hanging with no work getting done here:
weights = mx.load(str(model_path / "mlx_mistral_7b.npz"))
I am reading into the mlx library now to see if there are further steps I can take
Strange.. I just ran it no problem on an M1 Max. Is it possible the weights are ill formatted? Maybe the conversion didn't work?
If this is still an issue, please reopen.
Definitely still an issue. I just ran the script through the night and it stayed at the same hang point.
Not sure how I can re-open this issue. Unless you mean make another post.
I'm going to try a debugger and step through it, or compile the mlx lib and see if I can find out what's wrong.
I followed the README's instructions exactly and repeated the weight conversion step multiple times now with no luck.
I will also see if I can make sure the mistral download has matching hash in case the download flubbed...
I'm experience kinda same, after following README it hangs with no output for more than 10 minutes but for me it has additional row:
python3 mistral.py --prompt "It is a truth universally acknowledged," --temp 0
[INFO] Loading model from disk.
[INFO] Starting generation...
It is a truth universally acknowledged,
After that no significal CPU/GPU usage and i have to kill process.
Enviroinment info: M1 Pro 16Gb Sonoma 14.1.1 (23B81) Python 3.9.5
Also strange that pip cannot find mlx package for python3.10 3.11 and 3.12 despite PyPi requires python 3.7+
@Diniden we've updated MLX and fixed a bug with loading from NumPy so try pip install --upgrade mlx and see if that helps. I would also double check the weights downloaded properly.
@Blucknote it looks like you are stuck in a very different spot. It would be helpful if you could fine more precisely where it is stuck by stepping through with PDB or adding some prints around this part of the code.
Also regarding installation, you should definitely be able to install MLX with pip with a higher version of Python. The install docs have more information on possible issues.
New version worked (sort of)!
The new release properly gave me:
libc++abi: terminating due to uncaught exception of type pybind11::error_already_set: BadZipFile: Bad CRC-32 for file 'layers.24.feed_forward.w3.weight.npy'
Instead of silent failure and infinite hang, which was really nice.
I re-downloaded the mistral weights and performed the steps again, and THEN it worked flawlessly!
Thanks for looking at this.
Patience I must have: in debugger I have output, the most long time row is y = sample(logits[:, -1, :])
may be on quantized weights it will be faster but this check for smarter person
Hmm, I wonder why it's so slow. I assume you are using the Metal back-end?
python -c "import mlx.core as mx; print(mx.default_device())"
Should give Device(gpu, 0).
If it's using the GPU, maybe there is some swapping, I am testing on a 32GB machine and it processes the prompt very quickly < second.
Should give
Device(gpu, 0).
Output as expected:
import mlx.core as mx;
PyDev console: starting.
print(mx.default_device())
Device(gpu, 0)
swap is defenitely using
However... can i trust htop on mac? :D System monitor gives me 14Gb Ram usage and 12Gb swap but htop says it's 1.3 Gb mem and ~12 Gb on swap
Stats shows same as system monitor
I'm experience kinda same, after following README it hangs with no output for more than 10 minutes but for me it has additional row:
python3 mistral.py --prompt "It is a truth universally acknowledged," --temp 0 [INFO] Loading model from disk. [INFO] Starting generation... It is a truth universally acknowledged,After that no significal CPU/GPU usage and i have to kill process.
Enviroinment info: M1 Pro 16Gb Sonoma 14.1.1 (23B81) Python 3.9.5
Also strange that pip cannot find mlx package for python3.10 3.11 and 3.12 despite PyPi requires python 3.7+
experiencing same issue M1 Pro 16gb, stuck after first phrase
Thanks for this library. Was looking for this for a long time.
I have M2 pro 16gb. Memory peaks for python process to 13.56GB and gets stuck.
python -c "import mlx.core as mx; print(mx.default_device())" gives Device(gpu, 0)
What is the ideal memory size of the device required 32GB and more? because the model converted weight is 14GB so we might need quantisation here.
Also how much is ideal for lora finetuning? running on 16GB mac Pro seems overkill.
I have a 32GB machine and the Mistral example runs pretty quickly. It seems like somewhere between 16 and 32 is the cutoff right now to get good perf, but I'm not sure exactly where. We're investigating memory use to see if it can be improved.
I have a 32GB machine and the Mistral example runs pretty quickly. It seems like somewhere between 16 and 32 is the cutoff right now to get good perf, but I'm not sure exactly where. We're investigating memory use to see if it can be improved.
Thank you guys for this great library.
But I have also encountered same issue with [singhaki,](https://github.com/ml-explore/mlx-examples/issues/25#issuecomment-1846550551). The mistral example runs correctly, within 2 hours... My computer is a M2 Pro 16gb too. Here is the memory usage.
can confirm m2pro 32gb works without issues. loads and outputs
I observe the same pattern than the one reported by @sproblvem here. When running either Mistral or Llama with 16GB RAM, it runs pretty slow (using either my own built weights or downloaded from Hugging Face). My setup has the configuration here.
When running the quantized model on either option, performance improved a lot.
My setup is: MacBook Air M1, 16GB RAM.
quantized model
@rodchile can you share more details please? how did you get quantizing? If it was gguf from TheBloke or something other? If it was gguf does conversion script able to convert to required format?
@Blucknote There are pre-converted quantized models in the MLX Hugging Face community: https://huggingface.co/mlx-community
Also, all of the conversion scripts in the LLM examples can produce quantized models
@awni thanks!