LocalAI icon indicating copy to clipboard operation
LocalAI copied to clipboard

Add the new Multi-Modal model of mistral AI: mistral-small-3.1-24b & pixtral-12b

Open SuperPat45 opened this issue 1 year ago • 10 comments

Add the new Multi-Modal model of mistral AI: mistral-small-3.1-24b and pixtral-12b:

https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503 https://huggingface.co/mistral-community/pixtral-12b-240910

SuperPat45 avatar Sep 12 '24 11:09 SuperPat45

Since yesterday vllm has internVL2 support. :-)

vllm-project/vllm/releases/tag/v0.6.1

AlexM4H avatar Sep 13 '24 10:09 AlexM4H

I guess that would work already with llama.cpp GGUF models if/when is getting supported in there ( see also https://github.com/ggerganov/llama.cpp/issues/9440 ).

I'd change the focus of this one to be more generic and add support for multimodal with vLLM, examples:

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_pixtral.py https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py

mudler avatar Sep 13 '24 16:09 mudler

vllm already has llama 3.2 support https://github.com/vllm-project/vllm/pull/8811

Georgi wrote two weeks ago: "Not much has changes since the issue was created. We need contributions to improve the existing vision code and people to maintain it. There is interest to reintroduce full multimodal support, but there are other things with higher priority that are currently worked upon by the core maintainers of the project." (https://github.com/ggerganov/llama.cpp/issues/8010#issuecomment-2345831496)

AlexM4H avatar Sep 26 '24 06:09 AlexM4H

See also: https://github.com/ggerganov/llama.cpp/issues/9455

mudler avatar Sep 26 '24 08:09 mudler

BTW: "(Coming very soon) 11B and 90B Vision models

11B and 90B models support image reasoning use cases, such as document-level understanding including charts and graphs and captioning of images."

(https://ollama.com/blog/llama3.2)

AlexM4H avatar Sep 26 '24 08:09 AlexM4H

BTW: "(Coming very soon) 11B and 90B Vision models

11B and 90B models support image reasoning use cases, such as document-level understanding including charts and graphs and captioning of images."

(https://ollama.com/blog/llama3.2)

that would be interesting to see given upstream(llama.cpp) is still working on it: https://github.com/ggerganov/llama.cpp/issues/9643

mudler avatar Sep 26 '24 08:09 mudler

It seems they work independently on that https://github.com/ollama/ollama/pull/6963

AlexM4H avatar Sep 26 '24 08:09 AlexM4H

It seems they work independently on that ollama/ollama#6963

that looks only golang-side of things to fit the images. The real backend changes seems to be in https://github.com/ollama/ollama/pull/6965

mudler avatar Sep 26 '24 08:09 mudler

It seems they work independently on that ollama/ollama#6963

that looks only golang-side of things to fit the images. The real backend changes seems to be in ollama/ollama#6965

Oh, yes. Wrong link.

AlexM4H avatar Sep 26 '24 09:09 AlexM4H

Mistral-small-3.1 with vision is now supported in ollama in this PR: https://github.com/ollama/ollama/pull/10099

SuperPat45 avatar Apr 20 '25 10:04 SuperPat45