llama.cpp Feature Request: Add support for Phi-3.5 MoE and Vision Instruct

trafficstars

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Microsoft has recently dropped two new models in the Phi Family.

3.5 MoE: https://huggingface.co/microsoft/Phi-3.5-MoE-instruct 3.5 Vision: https://huggingface.co/microsoft/Phi-3.5-vision-instruct

It would be nice to see support added to llama.cpp for these two models.

Motivation

Supporting all model releases so the wider community can enjoy these great free models.

Possible Implementation

No response

Aug 21 '24 14:08 YorkieDev

MoE looks promising. Any word on how complex it is to add support for?

Aug 22 '24 06:08 curvedinf

Is someone working on it? :pray:

Aug 23 '24 19:08 JackCloudman

Especially vision would be worth it. But I lack the knowledge to do smth. like this.

Aug 24 '24 09:08 simsi-andy

Yes, the vision model is surprisingly good. As a gguf format under llama.cpp, this would open up undreamt-of possibilities

Bildschirmfoto_20240821_213502

Aug 25 '24 04:08 mounta11n

ChatLLM.cpp supports Phi-3.5 MoE model now.

For developers: MoE Sparse MLP is ~the same as~ a little different from the one used in Mixtral.

Aug 28 '24 08:08 foldl

https://github.com/foldl/chatllm.cpp

| Supported Models | Download Quantized Models |

What's New:

2024-08-28: Phi-3.5 Mini & MoE

Inference of a bunch of models from less than 1B to more than 300B, for real-time chatting with RAG on your computer (CPU), pure C++ implementation based on @ggerganov's ggml.

| Supported Models | Download Quantized Models |

What's New:

2024-08-28: Phi-3.5 Mini & MoE

Aug 28 '24 19:08 ayttop

https://huggingface.co/microsoft/Phi-3.5-MoE-instruct/discussions/4

microsoft/Phi-3.5-MoE-instruct convert to gguf gguf

Aug 28 '24 19:08 ayttop

Pretty sad to see no support for Phi 3.5 MoE in llama.cpp. Sure, it might have dry writing and is very censored, but in assistant tasks it's much better than all the smaller models combined. It truly has 70B quality in just 6.6B active parameters so its much easier to run than even G2 27B (which it beats according to benchmarks).

Aug 29 '24 18:08 Dampfinchen

@Dampfinchen, have you found any way to run Phi 3.5 MoE locally? I'm open to try out alternatives to llama.cpp.

Aug 29 '24 23:08 sourceholder

Also eager to get Phi 3.5-Vision support. Most accurate photo and screenshot descriptions I've seen so far.

Sep 01 '24 18:09 arnesund

@Dampfinchen @sourceholder @arnesund if you are interested in running Phi 3.5 MoE or Phi 3.5 vision with alternatives to llama.cpp, perhaps you could check out mistral.rs.

Just a quick description:

We have support for Phi 3.5 MoE (docs & example: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3.5MOE.md) and Phi 3.5 vision (docs & examples: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3V.md).

All models can be run with CUDA, Metal, or CPU SIMD acceleration. We have Flash Attention and PagedAttention support for increased inference performance, and support in-situ quantization in GGUF and HQQ formats.

If you are using the OpenAI API, you can use the provided OpenAI-compatible (superset, we have things like min-p, DRY, etc) HTTP server. There is also a Python package. For Phi 3.5 MoE and other text models, there is also an interactive chat mode.

Sep 02 '24 18:09 EricLBuehler

Thank you, but I and many others rather wait for official support.

I wonder what's the holdup? Shouldn't it be possible to copy a lot of the code from Mixtral to Phi 3.5 MoE given they have a pretty similar architecture with two experts?

Sep 03 '24 10:09 Dampfinchen

Thank you, but I and many others rather wait for official support.

I wonder what's the holdup? Shouldn't it be possible to copy a lot of the code from Mixtral to Phi 3.5 MoE given they have a pretty similar architecture with two experts?

no one's taken the task up yet sadly. there's presently work being done on Phi-3.5 Vision Instruct though which is something to look forward to considering the reported vision understanding that the model has.

Sep 03 '24 10:09 Thellton

phi-3.5-moe-instruct gguf lamacpp???????????????????????

Sep 04 '24 17:09 ayttop

Bumping up thread. :)

Sep 12 '24 11:09 bunnyfu

Strange why no one is looking into this. MoE seems to be the best model currently that can run on consumer-grade CPU

Sep 16 '24 01:09 vaibhav1618

@vaibhav1618, FYI - Deepseek V2 Lite (16B) is another good MoE model. 2.4B activated params.

Sep 16 '24 01:09 sourceholder

Phi-3.5 MoE seems to be based on https://huggingface.co/microsoft/GRIN-MoE/tree/main. Maybe their technical report at https://arxiv.org/abs/2409.12136 can help at identifying differences to other MoE architectures, which should ease adoption in llama.cpp.

Sep 21 '24 11:09 ThiloteE

there's presently work being done on Phi-3.5 Vision Instruct though which is something to look forward to considering the reported vision understanding that the model has

I'm wondering where's the work being done on Phi-3.5 Vision Instruct ? Much thanks!

Oct 03 '24 18:10 yueshen-intel

@Dampfinchen @sourceholder @arnesund if you are interested in running Phi 3.5 MoE or Phi 3.5 vision with alternatives to llama.cpp, perhaps you could check out mistral.rs.

Just a quick description:

We have support for Phi 3.5 MoE (docs & example: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3.5MOE.md) and Phi 3.5 vision (docs & examples: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3V.md).

All models can be run with CUDA, Metal, or CPU SIMD acceleration. We have Flash Attention and PagedAttention support for increased inference performance, and support in-situ quantization in GGUF and HQQ formats.

If you are using the OpenAI API, you can use the provided OpenAI-compatible (superset, we have things like min-p, DRY, etc) HTTP server. There is also a Python package. For Phi 3.5 MoE and other text models, there is also an interactive chat mode.

@EricLBuehler , Can you recommend some frontend app to use mistral.rs?

Oct 20 '24 10:10 limingchina

The PR in the transformers repo to support Phi-3.5 MoE has been merged and is featured in release v4.46.0, so maybe finally llama.cpp can add this model architecture?

Oh and by the way, i just found the documentation for how to add a new model to llama.cpp, after having followed this repo for months now, lol. Here are the docs: https://github.com/ggerganov/llama.cpp/blob/master/docs/development/HOWTO-add-model.md

Oct 25 '24 07:10 ThiloteE

+1 for MoE support.

Nov 10 '24 16:11 sinand99

Also eager to get Phi 3.5-Vision support. Most accurate photo and screenshot descriptions I've seen so far.

+1 for Phi 3.5-Vision support.

Nov 29 '24 03:11 skylake5200

This issue was closed because it has been inactive for 14 days since being marked as stale.

Feb 12 '25 01:02 github-actions[bot]

llama.cpp llama.cpp copied to clipboard

Feature Request: Add support for Phi-3.5 MoE and Vision Instruct

Prerequisites

Feature Description

Motivation

Possible Implementation

llama.cpp
llama.cpp copied to clipboard