mlx What is the Expected Inference Performance

I am running Llama/Mistral inference examples on my M1Pro with 16GB of memory and getting around 80sec/token.

Does the framework support FP16?
GPU usage seems low, do I need to do something to use the Metal GPU?
mx.default_device reports Device(gpu, 0)

Dec 07 '23 09:12 tcapelle

FP16/BF16 are both supported dtypes here

The ops are lazy and will only execute the compute as needed, but if the default_device indicates the gpu it should be using the metal kernels

Dec 07 '23 17:12 dc-dc-dc

I am curious if someone managed to run this on a laptop outside of the Ultras.

Dec 07 '23 20:12 tcapelle

I am running Llama/Mistral inference examples on my M1Pro with 16GB of memory and getting around 80sec/token.

Are you using the 7b llama and 7b mistral model? Is it a typo? Do you mean(80ms/token or 80sec/token)?

Dec 08 '23 08:12 lin72h

Yes, I am using the provided mistral example. It's not a typo it takes around 80 seconds to generate 1 token.

Dec 08 '23 08:12 tcapelle

GPU usage seems low right ?

Dec 08 '23 11:12 khiet1234

@tcapelle Can you please check your memory pressure when running the model? At 16GB of memory, you may be running out of wired memory since the example uses FP16 (weights total nearly 14.6GB) and inference takes a bit more than that.

Dec 09 '23 05:12 arpan-dhatt

@tcapelle I tried the LLaMA example on my M1 Pro 32GB. It's indeed slow, and I think that's mostly due to the weights being FP32. I haven't checked Mistral example yet, but this performance is expected if that is also FP32. Transformer inference is typically memory-bound and using FP32 is a bottleneck.

Did you do additional modifications to run the example in FP16 or did I miss something?

Dec 11 '23 18:12 briancpark

Yes that's an oversight, the Mistral example does fp16, but the llama does fp32 by defualt since that's what the weights are saved in.

You can see an example of casting the weights in the Mistral file. We should add the same for Llama (and probably just save them as fp16 in the first place as it doesn't seem to make a difference).

Dec 11 '23 18:12 awni

Running Inference with MLX won't touch ANE in anyway, right?

Dec 12 '23 14:12 rovo79

@rovo79 Correct. As per #18, ANE API is closed source and not publicly accessible. I believe the only way to touch ANE today is via CoreML.

Dec 12 '23 22:12 briancpark

So, has someone managed to run a 7B inference using MLX on 16GB of RAM? Or do you need an Ultra to make any use of MLX?

Dec 13 '23 08:12 tcapelle