mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

mistral example seems to hang on Loading model from disk.

Open Diniden opened this issue 8 months ago • 19 comments

I am trying to run the mistral model 7B following the README's instructions (and adding the missing numpy dep)

This is all I see for a very long time

> python mistral.py --prompt "It is a truth universally acknowledged,"  --temp 0
[INFO] Loading model from disk.

Hardware: Apple M1 Max OS: 14.1.2 (23B92) Python 3.11.5 conda 23.7.4

My python process stays at a really low CPU and I am not seeing any disk access that is worthy of fiddling with 14GB of data.

Any way to better debug this or get to something working?

Diniden avatar Dec 06 '23 22:12 Diniden

Printed out logs and found it hangs here as probably expected. Unsure how to debug why the mlx library is hanging with no work getting done here:

weights = mx.load(str(model_path / "mlx_mistral_7b.npz"))

I am reading into the mlx library now to see if there are further steps I can take

Diniden avatar Dec 06 '23 23:12 Diniden

Strange.. I just ran it no problem on an M1 Max. Is it possible the weights are ill formatted? Maybe the conversion didn't work?

awni avatar Dec 07 '23 05:12 awni

If this is still an issue, please reopen.

awni avatar Dec 07 '23 15:12 awni

Definitely still an issue. I just ran the script through the night and it stayed at the same hang point.

Not sure how I can re-open this issue. Unless you mean make another post.

I'm going to try a debugger and step through it, or compile the mlx lib and see if I can find out what's wrong.

I followed the README's instructions exactly and repeated the weight conversion step multiple times now with no luck.

I will also see if I can make sure the mistral download has matching hash in case the download flubbed...

Diniden avatar Dec 07 '23 16:12 Diniden

I'm experience kinda same, after following README it hangs with no output for more than 10 minutes but for me it has additional row:

python3 mistral.py --prompt "It is a truth universally acknowledged,"  --temp 0
[INFO] Loading model from disk.
[INFO] Starting generation...
It is a truth universally acknowledged,

After that no significal CPU/GPU usage and i have to kill process.

Enviroinment info: M1 Pro 16Gb Sonoma 14.1.1 (23B81) Python 3.9.5

Also strange that pip cannot find mlx package for python3.10 3.11 and 3.12 despite PyPi requires python 3.7+

Blucknote avatar Dec 07 '23 16:12 Blucknote

@Diniden we've updated MLX and fixed a bug with loading from NumPy so try pip install --upgrade mlx and see if that helps. I would also double check the weights downloaded properly.

@Blucknote it looks like you are stuck in a very different spot. It would be helpful if you could fine more precisely where it is stuck by stepping through with PDB or adding some prints around this part of the code.

Also regarding installation, you should definitely be able to install MLX with pip with a higher version of Python. The install docs have more information on possible issues.

awni avatar Dec 07 '23 16:12 awni

New version worked (sort of)!

The new release properly gave me:

libc++abi: terminating due to uncaught exception of type pybind11::error_already_set: BadZipFile: Bad CRC-32 for file 'layers.24.feed_forward.w3.weight.npy'

Instead of silent failure and infinite hang, which was really nice.

I re-downloaded the mistral weights and performed the steps again, and THEN it worked flawlessly!

Thanks for looking at this.

Diniden avatar Dec 07 '23 17:12 Diniden

Patience I must have: in debugger I have output, the most long time row is y = sample(logits[:, -1, :]) may be on quantized weights it will be faster but this check for smarter person

Blucknote avatar Dec 07 '23 17:12 Blucknote

Hmm, I wonder why it's so slow. I assume you are using the Metal back-end?

python -c "import mlx.core as mx; print(mx.default_device())"

Should give Device(gpu, 0).

If it's using the GPU, maybe there is some swapping, I am testing on a 32GB machine and it processes the prompt very quickly < second.

awni avatar Dec 07 '23 17:12 awni

Should give Device(gpu, 0).

Output as expected:

import mlx.core as mx;
PyDev console: starting.
print(mx.default_device())
Device(gpu, 0)

swap is defenitely using

However... can i trust htop on mac? :D System monitor gives me 14Gb Ram usage and 12Gb swap but htop says it's 1.3 Gb mem and ~12 Gb on swap

Stats shows same as system monitor

Blucknote avatar Dec 07 '23 18:12 Blucknote

I'm experience kinda same, after following README it hangs with no output for more than 10 minutes but for me it has additional row:

python3 mistral.py --prompt "It is a truth universally acknowledged,"  --temp 0
[INFO] Loading model from disk.
[INFO] Starting generation...
It is a truth universally acknowledged,

After that no significal CPU/GPU usage and i have to kill process.

Enviroinment info: M1 Pro 16Gb Sonoma 14.1.1 (23B81) Python 3.9.5

Also strange that pip cannot find mlx package for python3.10 3.11 and 3.12 despite PyPi requires python 3.7+

experiencing same issue M1 Pro 16gb, stuck after first phrase

eugenepyvovarov avatar Dec 07 '23 20:12 eugenepyvovarov

Thanks for this library. Was looking for this for a long time.

I have M2 pro 16gb. Memory peaks for python process to 13.56GB and gets stuck.

python -c "import mlx.core as mx; print(mx.default_device())" gives Device(gpu, 0)

What is the ideal memory size of the device required 32GB and more? because the model converted weight is 14GB so we might need quantisation here.

Also how much is ideal for lora finetuning? running on 16GB mac Pro seems overkill.

akashicMarga avatar Dec 08 '23 05:12 akashicMarga

I have a 32GB machine and the Mistral example runs pretty quickly. It seems like somewhere between 16 and 32 is the cutoff right now to get good perf, but I'm not sure exactly where. We're investigating memory use to see if it can be improved.

awni avatar Dec 08 '23 20:12 awni

I have a 32GB machine and the Mistral example runs pretty quickly. It seems like somewhere between 16 and 32 is the cutoff right now to get good perf, but I'm not sure exactly where. We're investigating memory use to see if it can be improved.

Thank you guys for this great library. But I have also encountered same issue with [singhaki,](https://github.com/ml-explore/mlx-examples/issues/25#issuecomment-1846550551). The mistral example runs correctly, within 2 hours... My computer is a M2 Pro 16gb too. Here is the memory usage. image

sproblvem avatar Dec 11 '23 13:12 sproblvem

can confirm m2pro 32gb works without issues. loads and outputs

bigsnarfdude avatar Dec 17 '23 19:12 bigsnarfdude

I observe the same pattern than the one reported by @sproblvem here. When running either Mistral or Llama with 16GB RAM, it runs pretty slow (using either my own built weights or downloaded from Hugging Face). My setup has the configuration here.

When running the quantized model on either option, performance improved a lot.

My setup is: MacBook Air M1, 16GB RAM.

rodchile avatar Dec 29 '23 22:12 rodchile

quantized model

@rodchile can you share more details please? how did you get quantizing? If it was gguf from TheBloke or something other? If it was gguf does conversion script able to convert to required format?

Blucknote avatar Dec 30 '23 09:12 Blucknote

@Blucknote There are pre-converted quantized models in the MLX Hugging Face community: https://huggingface.co/mlx-community

Also, all of the conversion scripts in the LLM examples can produce quantized models

awni avatar Dec 30 '23 15:12 awni

@awni thanks!

Blucknote avatar Dec 30 '23 21:12 Blucknote