llama.cpp support MiniCPM-V-2.5

Dear llama.cpp Official,

Hi, I'm writing to address our new PR submission for integrating our model MiniCPM-Llama3-V 2.5 into llama.cpp, which has been trending on Huggingface for over a week and has garnered significant user demand. During the previous PR attempt of MiniCPM-V, we identified several critical implementation bugs. The official minicpm-v team has since fixed all these issues, resulting in a performance that matches our PyTorch version. These changes also distinguish our implementation significantly from LLaVA example codebase.

Here are some key differences and improvements we've made:

Flexible Image Handling: We support arbitrary image sizes by dynamically segmenting images into sub-images, allowing our ViT to accept various aspect ratios, unlike the fixed dimensions required by other models.
2D Resampler: Our model uses a 2D resampler to down sample image features into smaller sequences, significantly speeding up inference.
Enhanced Embedding: Unlike the original positional encoding of VIT used in previous VLMs, we employ a new approach for image embedding with a PosEmbedding layer.
Distinct Tokenizer: Our tokenizer is different from LLaVA's, leading to unique special token decoding.
Upper Framework Support: We've optimized our model for better integration with frameworks like Ollama.
CLI Optimization: We've made modifications to better adapt the CLI for Android use.
NPU-Optimized ViT: We've rewritten the Vision Transformer (ViT) component to leverage NPU on mobile devices, optimizing I/O for Android inference. (this week)

While some aspects of our implementation may appear similar to LLaVA example codebase, these distinct features and optimizations set our model apart. We can reference LLaVA for the overlapping components to maintain code integrity, but this might compromise the standalone nature of different examples, akin to how Huggingface Transformers ensures each model has its unique implementation.

Given the extensive user interest and the robust performance of our implementation, merging this model would significantly benefit the community. We are open to collaborating on any adjustments you deem necessary and are committed to ensuring the highest code quality and usability.

Thank you for considering our request. We look forward to your feedback and hope for a positive resolution.

Best regards, MiniCPM-V Official ^_^

May 28 '24 20:05 tc-mb

waiting for this to be approved

May 29 '24 21:05 x4080

Hi, first of all, thanks for taking the time to train the model as well as provide llama.cpp implementation.

I played with this PR for a while and it seems there's a huge difference in quality between llama.cpp implementation and what's available on your huggingface space. Any idea what causes that? For reference I was using your image of the plane, asking "What's this aircraft?". Same sampling parameters as in the huggingface space. I tried with Q4_K_M, Q6_K, Q8_0, f16 checkpoints and got similar results (which is expected), all of them much worse than your hf demo.

Examples: ./minicpmv-cli -m ~/Downloads/ggml-model-Q8_0.gguf --mmproj ~/Downloads/mmproj-model-f16.gguf -c 4096 --temp 0 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image ~/test4.jpeg

What is this aircraft This is a large passenger jet.

Note: I tried with different values of --temp, 0 was set so that it was easier to reproduce. vs

The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database.

from your README.md.

That's just one example. In general asking about precise text information will yield hallucinations, as opposed to the model available in demo. When I asked the model about the plane's serial number/id number it usually returned something along the lines of B-016..., as opposed to B-6136 it was supposed to. Asking about url in screenshot works well in demo, but is broken in llama.cpp. Some of those issues could be explained by loosing data during quantization, but it's still happening in f16 variant.

~One possible explanation is that there was an issue with your converted models, since you used convert.py. They are using wrong pre-tokenizer and possibly other things since it was script made for llama/llama2 based models. I made #7627 that can be used to patch the models to use the right pre-tokenizer, but as I said, I'm not sure if that's all that's wrong.~ I tested with convert-hf-to-gguf.py and the results were pretty much the same.

python gguf-py/scripts/gguf-new-metadata.py --pre-tokenizer llama-bpe <input model path> <output path> can be used to patch your models, ~but you have to switch branch to #7627 since it's not merged yet.~

I haven't looked at the code in detail, I'll try to do it in the near future.

I'm not sure about keeping it separate from llava, since there's a lot of duplication. Ultimately I think that's @ggerganov's decision and he seems to be opposed (#6919 (comment)).

I definitely agree that high quality vision model would benefit the community greatly.

Well, we are also continuing to examine the model transformation problems and try to verify the precision degradation caused by the use of llama.cpp.

May 31 '24 14:05 tc-mb

Hi, first of all, thanks for taking the time to train the model as well as provide llama.cpp implementation.

I played with this PR for a while and it seems there's a huge difference in quality between llama.cpp implementation and what's available on your huggingface space. Any idea what causes that? For reference I was using your image of the plane, asking "What's this aircraft?". Same sampling parameters as in the huggingface space. I tried with Q4_K_M, Q6_K, Q8_0, f16 checkpoints and got similar results (which is expected), all of them much worse than your hf demo.

Examples: ./minicpmv-cli -m ~/Downloads/ggml-model-Q8_0.gguf --mmproj ~/Downloads/mmproj-model-f16.gguf -c 4096 --temp 0 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image ~/test4.jpeg

What is this aircraft This is a large passenger jet.

Note: I tried with different values of --temp, 0 was set so that it was easier to reproduce. vs

The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database.

from your README.md.

That's just one example. In general asking about precise text information will yield hallucinations, as opposed to the model available in demo. When I asked the model about the plane's serial number/id number it usually returned something along the lines of B-016..., as opposed to B-6136 it was supposed to. Asking about url in screenshot works well in demo, but is broken in llama.cpp. Some of those issues could be explained by loosing data during quantization, but it's still happening in f16 variant.

~One possible explanation is that there was an issue with your converted models, since you used convert.py. They are using wrong pre-tokenizer and possibly other things since it was script made for llama/llama2 based models. I made #7627 that can be used to patch the models to use the right pre-tokenizer, but as I said, I'm not sure if that's all that's wrong.~ I tested with convert-hf-to-gguf.py and the results were pretty much the same.

python gguf-py/scripts/gguf-new-metadata.py --pre-tokenizer llama-bpe <input model path> <output path> can be used to patch your models, ~but you have to switch branch to #7627 since it's not merged yet.~

I haven't looked at the code in detail, I'll try to do it in the near future.

I'm not sure about keeping it separate from llava, since there's a lot of duplication. Ultimately I think that's @ggerganov's decision and he seems to be opposed (#6919 (comment)).

I definitely agree that high quality vision model would benefit the community greatly.

On the question of where to put the model, I would like to consult the llama.cpp team on whether it is necessary to merge it all into llava, but can it still be called llava?

I saw ggerganov's reply, but he replied in minicpm-v2.0 's PR, this PR is only a simple modification, and the dynamic image size has not been changed, it is really not much different from llava, if I only read minicpm-v2.0 's PR, I will also agree with ggerganov's view.

But minicpmv2.5 has more changes, especially the dynamic image size and llava have many differences in the whole processing flow. We think it may be a good way to merge minicpm-v2.5 as this PR into llama.cpp, and then we will add minicpm-v2.0 and other functions, which makes it easier for us to add more features. Finally, because I'm not sure if ggerganov has seen this PR, I hope you will consider our suggestion.

Of course, if the llama.cpp team still needs us to merge into llava, we will also find a way to integrate the code into llava. Although this is obviously a bit complicated, it will take more time to modify. But we will respect the opinion of the llama.cpp team. ^_^

Jun 02 '24 06:06 tc-mb

The main roadblock for the multimodality support is that we don't have a long-term vision how to implement the API and people to work on it. I can do that, but it will take me a while before I can focus on this - there is a lot of work still to be done for LLMs

Duplicating the existing clip/llava codebase would not help for sure. It's better to find ways to reuse the code, improve the API and unify the multimodal implementations. But atm I can't give anything more specific than that as a guidance

Jun 03 '24 08:06 ggerganov

add all code into llava, I don't know if this improvement is enough. ^_^

Jun 05 '24 03:06 tc-mb

@Galunid @mofosyne Hi, do you think this merger is possible?

Jun 12 '24 12:06 tc-mb

@tc-mb

The python demo is impressive, even when considering that the current PR is not the cleanest way of integrating it; If the results from python can be replicated it is definitely worth adding it in my opinion. I reviewed the PR and there are some minor and major issues left:

Major:

The model using llama.cpp responds significantly worse than on the web demo, below llava 1.5 in my examples
After getting that bad quality I tried to convert it myself, first hoop is that the readme mentioned convert.py is outdated (not available anymore in main dir) I tried using the legacy convert.py but the resulting GGUF file errors out with invalid magic characters "PK".

The less redundant files the better so maintaining them does not become a big issue in the future. Minor:

The examples subdir doesn't exist anymore but is referenced in cmakefile
At the moment the new client is not being built by cmake, it has no section in the cmakefile
The readme is referencing the old fork and out dated python tools.
It should also be possible to remove the minicpm-cli.cpp and just use llava-cli.cpp, built with a define flag (#if could trigger the wrapper headers to be included).
Longterm the wrapper is not a nice solution but for now imho it would be worth it if the generation quality issue is fixed.
I don't think we should keep the separate encoder and surgery python files: The current "llava-surgery-v2" was intended to handle all sorts of models (pytorch, safetensor and different types of projectors), the new one should just be added into it similar to the old ones instead of duplicating those tools.

For testing I've added a target into the cmakefile, removed the path from the examples cmakefile:

set(TARGET minicpmv-cli)
add_executable(minicpmv-cli minicpmv-cli.cpp)
install(TARGETS minicpmv-cli RUNTIME)
target_link_libraries(minicpmv-cli PRIVATE common minicpmv_wrapper llava ${CMAKE_THREAD_LIBS_INIT})

It compiles fine that way.

Generation quality issue The generation quality however is the biggest issue, something seems to be quite wrong: I used the provided Q5K and 16 bit projector from your HF site. I tried to run it through a US drivers license which is greatly OCR'd using your online python demo, but using minicpm-cli the results are worse than llava-1.5 (almost everything wrong, tons of hallucinations)

https://www.ocregister.com/wp-content/uploads/2018/10/0418_nws_idb-l-allen-col-0418-1.jpg?w=1035

.\build\bin\minicpmv-cli.exe -m Q:\models\llava\minicpm-2.5\ggml-model-Q5_K_S.gguf --mmproj Q:\models\llava\minicpm-2.5\mmproj-model-f16.gguf --image C:\temp\LICENSE_DEMO.jpg  -ngl 99 -p "Describe all text visible of this demo image, do not censor anything and respond in json" --temp 0 -c 2048 -n 1000

<user>Describe all text visible of this demo image, do not censor anything and respond in json
<assistant>
The image is a composite of two different ID cards, likely for the same person, with various personal and biometric data. Here's the structured description of the text and elements visible in the image:

**Left ID Card:**
- **Header:** "California USA DRIVER LICENSE"
- **Photo:** A woman's photo
- **Name:** "Lina Rodriguez"
- **Date of Birth:** "08/31/1974"
- **Address:** "1234 Main St, San Diego, CA 92101"
- **License Number:** "DL 123456"
- **Expiration Date:** "08/31/2017"
- **Signature:** "Lina Rodriguez"
- **Biometric Data:** "SEX F" (Female), "HAIR BRN" (Brunette), "EYES BRN" (Brown), "FINGERNAIL POLISH" (No polish), "LIPSTICK" (No lipstick), "HAIR 09/30/2017", "EYES 09/30/2017", "FINGERNAIL POLISH 09/30/2017", "LIPSTICK 09/30/2017"
- **Additional Information:** "CLASS A", "08/31/2017", "RSTR NONE", "DONOR", "VOTER", "SEX F", "HAIR BRN", "EYES BRN", "FINGERNAIL POLISH", "LIPSTICK", "HAIR 09/30/2017", "EYES 09/30/2017", "FINGERNAIL POLISH 09/30/2017", "LIPSTICK 09/30/2017"

**Right ID Card:**
- **Header:** "USA DRIVER LICENSE"
- **Photo:** A woman's photo
- **Name:** "Lina Rodriguez"
...

Jun 12 '24 15:06 cmp-nct

Basically what @cmp-nct said. The generation quality is the biggest issue, especially when working with text. Have you tested if tokenizer works as you'd expect?

Jun 13 '24 20:06 Galunid

Basically what @cmp-nct said. The generation quality is the biggest issue, especially when working with text. Have you tested if tokenizer works as you'd expect?

I think the problem is deeper than that, it also saw two images in my example instead of one. Most text was totally misread, as if tokens were not in the right sequence and/or clip tensors were not correct. It looks like one or possibly multiple issues in CLIP and image sampling/ordering.

The two major problems (generation quality, conversion) need to be solved. Then I'd recommend a merge to not diverge further from the master, getting rid of the minor redundancy can be done in later updates.

Jun 13 '24 21:06 cmp-nct

May I confirm that you are using a new model in the process? Because the code in pr should not be able to directly use our gguf model on hf, these gguf are matched with our own fork code. But it ask me to use the new convert method, and I'm not sure if this will work well. Because I don't know the changes in llama.cpp 's conversion script for llama3.

Jun 14 '24 05:06 tc-mb

May I confirm that you are using a new model in the process? Because the code in pr should not be able to directly use our gguf model on hf, these gguf are matched with our own fork code. But it ask me to use the new convert method, and I'm not sure if this will work well. Because I don't know the changes in llama.cpp 's conversion script for llama3.

Hi, I was not able to convert a fresh one due to the error described so I used the GGUF from your HF repository. I fear those are quite outdated and some issues might come from it (input tokenization might be off ?). But it does not explain the severe problems, the CLIP model should not be different right ?

Can you run the license image on your local copy and test what results you get ? Your web demo is able to provide a flawless OCR of all IDs and numbers.

If we can get the generation quality fixed and the conversion working I'd want to get this PR merged as soon as possible. Every further week that passes the master is changing so merging gets harder over time.

Jun 14 '24 11:06 cmp-nct

May I confirm that you are using a new model in the process? Because the code in pr should not be able to directly use our gguf model on hf, these gguf are matched with our own fork code. But it ask me to use the new convert method, and I'm not sure if this will work well. Because I don't know the changes in llama.cpp 's conversion script for llama3.

Hi, I was not able to convert a fresh one due to the error described so I used the GGUF from your HF repository. I fear those are quite outdated and some issues might come from it (input tokenization might be off ?). But it does not explain the severe problems, the CLIP model should not be different right ?

Can you run the license image on your local copy and test what results you get ? Your web demo is able to provide a flawless OCR of all IDs and numbers.

If we can get the generation quality fixed and the conversion working I'd want to get this PR merged as soon as possible. Every further week that passes the master is changing so merging gets harder over time.

Hi! Sorry for the inconvenience. After looking into it, we also found some differences between the behavior of the proposed PR version and the actual Python model demo (also different from our fork). Once we noticed the problem, we re-ran the results from our fork, and they looked fine. We think the issue might be due to some changes in other parts of the master branch. As you can see from the results below, the previous version seems pretty accurate. We're eager to get the code merged into the official repository as soon as possible. We're doing our best to find out what's going wrong and would appreciate your help in identifying the issue.

At the end, We'll provide the complete steps to reproduce it.

# running script:

./minicpmv-cli -m ../openbmb/MiniCPM-V-2_5/gguf/ggml-model-Q4_K_M.gguf --mmproj ../openbmb/MiniCPM-V-2_5/new/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image /Users/a0/Pictures/20240614-205027.jpeg -p "Describe all text visible of this demo image, do not censor anything and respond in json"

# result from our fork:

<assistant>
The image displays a driver license from California, USA, with various personal and identification details.
At the top of the image:
"California USA DRIVER LICENSE"

In the center, within a card-like border:
"DL 11234568"
"EXP 08/31/2014"
"LN CARDHOLDER"
"FN IMA"

On the left side within a blue rectangle:
"CLASS C"
"END NONE"

On the right side within a red rectangle:
"RSTRNONE"
"SEX F"
"HG BRN"
"HAIR BRN"
"EYES BRN"
"ISS HGT 125 lb"
"OBT 08/31/2009"

At the bottom of the image within a black rectangle:
"Ma. Cardholder"
"00/0000000000000NNAN/AFD/YY"
"08/31/2009"

On the left side, near the woman's photo:
"I'm"
"Cardholder"

Jun 14 '24 13:06 tc-mb

This looks very promising!

I've been looking into the conversion issue and made a bit of progress on that end:

convert.py is not supported for llama3
your surgery process creates a "model" directory with the new model which needs to be converted The new method is convert-hf-to-gguf.py This fails because it appears to need trust_remote_code=True (line 375) AND the checksum detection is not working out. I used this hack to make it detect it as llama3-bpe in line 426:

        if chkhsh == "1baddeb572cd9de2a6d36f2ad0c361490bf5447dafca20afbac625e9d37f18a5":
            # minicpm
            res = "llama-bpe"

Maybe I am missing something but I think that's a flaw of the current conversion script, the checksum detection is nice for 90% of all cases but any changed models break that method. 3) With those two additions the conversion works like this: python .\convert-hf-to-gguf.py Q:\models\llava\minicpm-2.5\model .\build\bin\quantize Q:\models\llava\minicpm-2.5\model\ggml-model-f16.gguf q5_k 16

@Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ?

Generation quality with the new model is significantly better, but still not as good as your example.

<user>Describe all text visible of this demo image, do not censor anything
<assistant>
The image is a composite of various identification documents and a photograph, likely used for illustrative purposes related to identity verification or security.
- On the top left corner: "California USA DRIVER LICENSE"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait
- Bottom left corner: "CLASS A"
- Bottom right corner: "CLASS B"

- On the top right corner: "CLASS A"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait

Jun 14 '24 14:06 cmp-nct

@Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ?

@cmp-nct In general you shouldn't add checksum by hand, instead you should use convert-hf-to-gguf-update.py, which does that for you. You need to add the correct model there. There's some work done in #7379 to improve the process. You can check #6920 for details on why it was done this way. Unfortunately convert-hf-to-gguf-update.py has problem with loading remote code (as in it doesn't download that from the repo and it doesn't run it).

For now maybe it's best to use examples/convert-legacy-llama.py and then gguf-py/scripts/gguf-new-metadata.py --pre-tokenizer llama-bpe. I tried my hacked convert-hf-to-gguf.py a while ago and there wasn't a difference in generation quality.

Jun 15 '24 21:06 Galunid

@Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ?

@cmp-nct In general you shouldn't add checksum by hand, instead you should use convert-hf-to-gguf-update.py, which does that for you. You need to add the correct model there. There's some work done in #7379 to improve the process. You can check #6920 for details on why it was done this way. Unfortunately convert-hf-to-gguf-update.py has problem with loading remote code (as in it doesn't download that from the repo and it doesn't run it).

For now maybe it's best to use examples/convert-legacy-llama.py and then gguf-py/scripts/gguf-new-metadata.py --pre-tokenizer llama-bpe. I tried my hacked convert-hf-to-gguf.py a while ago and there wasn't a difference in generation quality.

When I tried the legacy converter I got a gguf binary with wrong magic (PK), using the manual checksum "hack" worked. The update process didn't work out for me, though it's always a pain when I am gone for 2-3 weeks and so much is changing that it feels like years have passed - I probably did something wrong :) I think a "force" option would be a good way for special cases, better than having to modify the python code (checksum) for a one-time conversion ?

Jun 15 '24 21:06 cmp-nct

This looks very promising!

I've been looking into the conversion issue and made a bit of progress on that end:

convert.py is not supported for llama3

your surgery process creates a "model" directory with the new model which needs to be converted The new method is convert-hf-to-gguf.py This fails because it appears to need trust_remote_code=True (line 375) AND the checksum detection is not working out. I used this hack to make it detect it as llama3-bpe in line 426:
        if chkhsh == "1baddeb572cd9de2a6d36f2ad0c361490bf5447dafca20afbac625e9d37f18a5":
            # minicpm
            res = "llama-bpe"
Maybe I am missing something but I think that's a flaw of the current conversion script, the checksum detection is nice for 90% of all cases but any changed models break that method. 3) With those two additions the conversion works like this: python .\convert-hf-to-gguf.py Q:\models\llava\minicpm-2.5\model .\build\bin\quantize Q:\models\llava\minicpm-2.5\model\ggml-model-f16.gguf q5_k 16

@Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ?

Generation quality with the new model is significantly better, but still not as good as your example.
<user>Describe all text visible of this demo image, do not censor anything
<assistant>
The image is a composite of various identification documents and a photograph, likely used for illustrative purposes related to identity verification or security.
- On the top left corner: "California USA DRIVER LICENSE"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait
- Bottom left corner: "CLASS A"
- Bottom right corner: "CLASS B"

- On the top right corner: "CLASS A"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait

Hi👋, any updates now? It looks like the results from the openbmb fork are fine, but the merge into this master branch is faulty? @tc-mb @cmp-nct

Jun 16 '24 20:06 Cuiunbo

@t

This looks very promising! I've been looking into the conversion issue and made a bit of progress on that end:

convert.py is not supported for llama3

your surgery process creates a "model" directory with the new model which needs to be converted The new method is convert-hf-to-gguf.py This fails because it appears to need trust_remote_code=True (line 375) AND the checksum detection is not working out. I used this hack to make it detect it as llama3-bpe in line 426:
        if chkhsh == "1baddeb572cd9de2a6d36f2ad0c361490bf5447dafca20afbac625e9d37f18a5":
            # minicpm
            res = "llama-bpe"
Maybe I am missing something but I think that's a flaw of the current conversion script, the checksum detection is nice for 90% of all cases but any changed models break that method. 3) With those two additions the conversion works like this: python .\convert-hf-to-gguf.py Q:\models\llava\minicpm-2.5\model .\build\bin\quantize Q:\models\llava\minicpm-2.5\model\ggml-model-f16.gguf q5_k 16 @Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ? Generation quality with the new model is significantly better, but still not as good as your example.
<user>Describe all text visible of this demo image, do not censor anything
<assistant>
The image is a composite of various identification documents and a photograph, likely used for illustrative purposes related to identity verification or security.
- On the top left corner: "California USA DRIVER LICENSE"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait
- Bottom left corner: "CLASS A"
- Bottom right corner: "CLASS B"

- On the top right corner: "CLASS A"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait
Hi👋, any updates now? It looks like the results from the openbmb fork are fine, but the merge into this master branch is faulty? @tc-mb @cmp-nct

I'm quite sure there are discrepancies on the fork side too. My guess is that the finetuning is slightly broken, even a single wrong token can cause projector based LLMs to become stupid. I am also not sure if our CLIP model doesn't have a fundamental computation issue, in my previous work on llava-1.6 I noticed significant differences compared to the reference but had no time to dig into it.

I hope @tc-mb can finish his PR here, minicpm in the python reference is quite stunning and would be a great benefit to llama.cpp (and higher level projects like ollama)

Jun 18 '24 13:06 cmp-nct

Is there any progress?

Jun 22 '24 03:06 Forevery1

@tc-mb I tried your most recent commit on a reference image: reference_2

.\build\bin\RelWithDebInfo\minicpmv-cli.exe -m Q:\models\llava\minicpm-2.5\model\ggml-model-f16.gguf --mmproj Q:\models\llava\minicpm-2.5\mmproj-model-f16.gguf --image C:\temp\reference_2.png -ngl 99 -p "What is in the lower left corner?" --temp 0 -c 2048 -n 1000 --verbose-prompt

The response was "There is a calculator in the lower left corner." This is the same error that we have with Microsoft's Phi-V which uses Siglip, as if the spatial patches are mixed up. The number of image tokens was just 4x 96

Below is the entire log:

Log start
clip_model_load: description:  image encoder for MiniCPM-V
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    455
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 455 tensors from Q:\models\llava\minicpm-2.5\mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                clip.has_minicpmv_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for MiniCPM-V
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  170 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  0
clip_model_load: minicpmv_projector:  1
clip_model_load: model size:     1044.86 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  1044.86 MB (455 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 104.80 MB
uhd_slice_image: multiple 4
uhd_slice_image: image_size: 1183 664; source_image size: 602 336
uhd_slice_image: image_size: 1183 664; best_grid: 3 1
uhd_slice_image: refine_image_size: 1050 588; refine_size: 1050 588
llava_image_embed_make_with_bytes_uhd: 602 336
llava_image_embed_make_with_bytes_uhd: 350 588
llava_image_embed_make_with_bytes_uhd: 350 588
llava_image_embed_make_with_bytes_uhd: 350 588
encode_image_with_clip: image embedding created: 96 tokens

encode_image_with_clip: image encoded in   108.98 ms by CLIP (    1.14 ms per image patch)
encode_image_with_clip: image embedding created: 96 tokens

encode_image_with_clip: image encoded in    63.51 ms by CLIP (    0.66 ms per image patch)
encode_image_with_clip: image embedding created: 96 tokens

encode_image_with_clip: image encoded in    62.05 ms by CLIP (    0.65 ms per image patch)
encode_image_with_clip: image embedding created: 96 tokens

encode_image_with_clip: image encoded in    60.96 ms by CLIP (    0.63 ms per image patch)
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from Q:\models\llava\minicpm-2.5\model\ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 128002
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7997 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW)
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: UNK token        = 128002 '<unk>'
llm_load_print_meta: PAD token        = 0 '!'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1002.00 MiB
llm_load_tensors:      CUDA0 buffer size = 14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

minicpmv_init: llava init in    13.17 ms.
process_image: image token past: 0
process_image: image token past: 400

minicpmv_init: llama process image in   332.57 ms.
<user>What is in the lower left corner?
<assistant>
There is a calculator in the lower left corner.

llama_print_timings:        load time =    5213.50 ms
llama_print_timings:      sample time =       2.70 ms /    11 runs   (    0.25 ms per token,  4075.58 tokens per second)
llama_print_timings: prompt eval time =     399.80 ms /   413 tokens (    0.97 ms per token,  1033.03 tokens per second)
llama_print_timings:        eval time =     194.29 ms /    10 runs   (   19.43 ms per token,    51.47 tokens per second)
llama_print_timings:       total time =    5414.02 ms /   423 tokens

I also tested the license demo, I used a slightly different prompt because it redacted everything before. The results on the license demo are better than before, though also nowhere near the superb quality of the python reference yet.

minicpmv_init: llama process image in   758.64 ms.
<user>Describe all text visible of this demo image, do not redact, answer in JSON format
<assistant>
'''json
{
  "type": "driver-license",
  "image": "https://i.imgur.com/7JY8T9L.jpg",
  "data": {
    "name": "California",
    "issuing_state": "California",
    "license_number": "1234568",
    "expiration_date": "08/31/2014",
    "address": "2500 N. 24TH ST, 2570 24TH ST, SAN FRANCISCO, CA 94103",
    "date_of_birth": "09/08/1977",
    "gender": "F",
    "hair_color": "BRN",
    "eye_color": "EXP",
    "height": "5'8\"",
    "weight": "125 lb",
    "race": "WHITE",
    "signature": "Igna Cardoso",
    "issuing_date": "08/31/2009",
    "issuing_authority": "California Department of Motor Vehicles"
  }
}
'''

llama_print_timings:        load time =    5633.64 ms
llama_print_timings:      sample time =      51.57 ms /   214 runs   (    0.24 ms per token,  4149.86 tokens per second)
llama_print_timings: prompt eval time =     828.18 ms /   914 tokens (    0.91 ms per token,  1103.63 tokens per second)
llama_print_timings:        eval time =    4187.23 ms /   213 runs   (   19.66 ms per token,    50.87 tokens per second)
llama_print_timings:       total time =    9919.41 ms /  1127 tokens

Jun 24 '24 14:06 cmp-nct

@cmp-nct
Hi, I'm sorry for the delay in the last two weeks.

Our team has limited manpower, and I was stuck on another urgent project until the end of last week. I will do my best to finish all the changes to the PR this week.

I convert the model using the advice mentioned earlier and re-reviewed the code, finding two issues in the previous PR. Our model supports images of any size and uses a different approach from llava. Even though I was as careful as possible when merging the code into llava, a key parameter was not passed to the lowest level, resulting in the model only using the default parameters. This caused the model to produce poor results, but with some correct outputs, which is why I didn't catch the bug immediately. The version I just submitted has fixed this issue.

Additionally, I'm not certain that this version is completely consistent with Python in terms of accuracy. This week, I will continue the work I hadn't completed before and quantitatively confirm the model's performance by running evaluation set metrics, rather than just checking a few cases.

Jun 25 '24 11:06 tc-mb

@cmp-nct Hi, I'm sorry for the delay in the last two weeks.

Our team has limited manpower, and I was stuck on another urgent project until the end of last week. I will do my best to finish all the changes to the PR this week.

I convert the model using the advice mentioned earlier and re-reviewed the code, finding two issues in the previous PR. Our model supports images of any size and uses a different approach from llava. Even though I was as careful as possible when merging the code into llava, a key parameter was not passed to the lowest level, resulting in the model only using the default parameters. This caused the model to produce poor results, but with some correct outputs, which is why I didn't catch the bug immediately. The version I just submitted has fixed this issue.

Additionally, I'm not certain that this version is completely consistent with Python in terms of accuracy. This week, I will continue the work I hadn't completed before and quantitatively confirm the model's performance by running evaluation set metrics, rather than just checking a few cases.

I ran your update and there are significant improvements in output quality! It's still not 100% where it should be based on the python reference.

I ran it on a still image of store goods and it mentioned a mirror and text, both not there. The driving license is better solved than any llava-1.5 models now, still not as flawless as the reference though. On spatial questions it is now closer to the correct answer (yellow sticky note), somehow it does not see everything. I tried to dig into that: The green sticky note is in the lower left corner, and below it is a yellow sticky note. Above the calculator, there is a pair of eyeglasses

Something is still strangely off, maybe in terms of image preprocessing or patching ? It answers as if the image-patches were tokenized in a wrong order

I'm excited for your evaluation results.

Jun 26 '24 13:06 cmp-nct

@cmp-nct @tc-mb My apologies if this has already been discovered, but from my quick research and experimentation yesterday, I have been able to successfully use openbmb/MiniCPM-Llama3-V-2_5-gguf VLM directly on LM Studio, upload an image and have it describe it. As well as nearly any other NON-Vision based llama 3 variant using xtuner's llama 3 mmproj file found here: https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/tree/main. I believe this could have clues to help speeding up this merge and even potentially help llama.cpp expand their VLM capabilities, as VLMs are growing very quickly in relevance and usefulness!

here is an example of it working, but most likely downgraded quality from llava influence: LM Studio VLM Test

This example is using the openbmb/MiniCPM-Llama3-V-2_5-gguf model with xtuner's llama 3 mmproj file, I have also tried it with Lewdiculous/L3-8B-Stheno-v3.2-GGUF-IQ-Imatrix/L3-8B-Stheno-v3.2-IQ3_M-imat.gguf and a few other llama 3 variants with overall good working results.

I can only hope this helps somehow, as llama.cpp is without a doubt a leading LLM based project, but we really need to figure out universal VLM compatibility soon, because in the upcoming months we will have MANY New VLM models that have "Full Consistent Video Vision" not only for on device use for example consistent screen share capabilities, but also consistent robotics vision and I think llama.cpp would be a great foundation for all of that.

Jul 04 '24 14:07 AINXTGENStudio

@cmp-nct Hi, I'm sorry for the delay in the last two weeks. Our team has limited manpower, and I was stuck on another urgent project until the end of last week. I will do my best to finish all the changes to the PR this week. I convert the model using the advice mentioned earlier and re-reviewed the code, finding two issues in the previous PR. Our model supports images of any size and uses a different approach from llava. Even though I was as careful as possible when merging the code into llava, a key parameter was not passed to the lowest level, resulting in the model only using the default parameters. This caused the model to produce poor results, but with some correct outputs, which is why I didn't catch the bug immediately. The version I just submitted has fixed this issue. Additionally, I'm not certain that this version is completely consistent with Python in terms of accuracy. This week, I will continue the work I hadn't completed before and quantitatively confirm the model's performance by running evaluation set metrics, rather than just checking a few cases.

I ran your update and there are significant improvements in output quality! It's still not 100% where it should be based on the python reference.

I ran it on a still image of store goods and it mentioned a mirror and text, both not there. The driving license is better solved than any llava-1.5 models now, still not as flawless as the reference though. On spatial questions it is now closer to the correct answer (yellow sticky note), somehow it does not see everything. I tried to dig into that: The green sticky note is in the lower left corner, and below it is a yellow sticky note. Above the calculator, there is a pair of eyeglasses

Something is still strangely off, maybe in terms of image preprocessing or patching ? It answers as if the image-patches were tokenized in a wrong order

I'm excited for your evaluation results.

We felt that we should have found the problem and had updated the c++ code. Maybe you can try this code.

By comparing it with c++, we actually discovered a bug hidden in the python code. But unfortunately, the model has been trained, so I can only imitate c++ and make the same mistake in python to use minicpmv2.5 well. We will improve this issue in future model training and provide the community with better performing open source models.

I verified the accuracy of the model on mme and found that gguf f16 would cause a loss of tens of points, and the quantitative version would continue to lose tens of points. Considering that mme's perfect score of 2800, this does not seem unacceptable. And when I traced the numerical differences from beginning to end, I found that the differences would exist from the very beginning, the interpolation function (bicubic_resize) in clip would also cause obvious numerical differences. I will try to make changes next week.

Jul 07 '24 05:07 tc-mb

@cmp-nct Hi, I'm sorry for the delay in the last two weeks. Our team has limited manpower, and I was stuck on another urgent project until the end of last week. I will do my best to finish all the changes to the PR this week. I convert the model using the advice mentioned earlier and re-reviewed the code, finding two issues in the previous PR. Our model supports images of any size and uses a different approach from llava. Even though I was as careful as possible when merging the code into llava, a key parameter was not passed to the lowest level, resulting in the model only using the default parameters. This caused the model to produce poor results, but with some correct outputs, which is why I didn't catch the bug immediately. The version I just submitted has fixed this issue. Additionally, I'm not certain that this version is completely consistent with Python in terms of accuracy. This week, I will continue the work I hadn't completed before and quantitatively confirm the model's performance by running evaluation set metrics, rather than just checking a few cases.

I ran your update and there are significant improvements in output quality! It's still not 100% where it should be based on the python reference. I ran it on a still image of store goods and it mentioned a mirror and text, both not there. The driving license is better solved than any llava-1.5 models now, still not as flawless as the reference though. On spatial questions it is now closer to the correct answer (yellow sticky note), somehow it does not see everything. I tried to dig into that: The green sticky note is in the lower left corner, and below it is a yellow sticky note. Above the calculator, there is a pair of eyeglasses Something is still strangely off, maybe in terms of image preprocessing or patching ? It answers as if the image-patches were tokenized in a wrong order I'm excited for your evaluation results.

We felt that we should have found the problem and had updated the c++ code. Maybe you can try this code.

By comparing it with c++, we actually discovered a bug hidden in the python code. But unfortunately, the model has been trained, so I can only imitate c++ and make the same mistake in python to use minicpmv2.5 well. We will improve this issue in future model training and provide the community with better performing open source models.

I verified the accuracy of the model on mme and found that gguf f16 would cause a loss of tens of points, and the quantitative version would continue to lose tens of points. Considering that mme's perfect score of 2800, this does not seem unacceptable. And when I traced the numerical differences from beginning to end, I found that the differences would exist from the very beginning, the interpolation function (bicubic_resize) in clip would also cause obvious numerical differences. I will try to make changes next week.

That sounds great, 10 points should not be a big hit in quality. Will I need to create a fresh gguf ? I'm using the last one.

1) Model issues I ran your changes and noticed some significant improvements, however the spatial test of reference_2.png is still strangely wrong. Can you give reference_2.png a test run ? Ask "what is below the green sticky note" or "what is in the lower left corner" Your python reference works great, but this PR still exchanges the green and the yellow note in my test. As if the spatial patches were not sorted or some image sizes are causing an issue ? Maybe there is another issue here, something different on my PC than on yours. Please test that image with those two questions.

2) Merging issues Please also look at the previous code comments, currently this PR can not be compiled due to errors in the two cmakelist files The C headers (like clip.h) need to be C compatible, otherwise other projects will not compile anymore. So the std::pair would need to be changed to a struct or into two separate inputs.

3) Final step I think best would be if you can clean pull this PR into your PC and configure/make it once. (like pull/7599/head:pull-request-test) Your local llama.cpp branch seems to have differences (additional directories, different cmakefiles) to this PR, that's why you can compile it fine but here it fails.

Another test that totally fails on llama.cpp is this wide ratio image:

<user>describe the image?
<assistant>
The image is a screenshot of a math problem related to fractions. The problem is presented in a step-by-step format, starting with a fraction 3/10 and then showing the process of dividing it by 15 to simplify it to 1/5. The answer to the problem is given as 3/5. Below the problem, there is a question asking which answer is correct, suggesting that the image is likely used for educational purposes to test understanding of fractions. The background is white, and the text is in black, making it easy to read. The overall layout is simple and straightforward, focusing on the mathematical content.

This one shows the problem closer, I don't think this is a problem with the bicubic resize (I had evaluated that function half a year ago and the results were good). I think you solved critical problems and now something related to image partitioning or patch ordering is remaining - possibly only for some resolution ratios.

<user>describe the formula in upper right, and the potential answers on bottom
<assistant>
The upper right of the image contains a mathematical formula that is part of a larger question. The formula is: 3 + 1/3 ÷ 1/9 = ? This is a complex fraction problem that requires the division of 1/9 by 1/3 and then adding 3 to the result. The potential answers to this question are provided at the bottom of the image, where the correct answer is highlighted in red text. The other options are (A) 31, (B) 49, (C) 109, (D) 15, and (E) 4. The correct answer is (D) 15.

Jul 10 '24 01:07 cmp-nct

I've summarized the issues for merging.

If possible, please also take a look at the two example images. There might be a problem with some resolutions, especially the calculation image doesn't work at all.

My suggestion/hope is that we can now bring it into a merge-able status, then that last error (calculation.png shows it closest) could maybe still be solved to get this into an excellent shape. I'm quite sure minicpm will dominate in usage for a longer time on llama.cpp

@cmp-nct hi,I have updated some of the merger issues mentioned above, and I can help you see if it is enough to continue the merger.

Regarding the immediate issue of orientation you mentioned, we also found and confirmed that this version of the model cannot be very stable in terms of this ability. We can only solve this problem in our next updated version.

I will update our gguf model on hf after merging this pr, so that the open source community can use the latest version of the model with official branches.

Here are the results of the above two example pictures：

JYP4IsS7YZ

img_v3_02cs_7bc43017-c60b-4ecb-b6b4-40a897cf579g

Jul 17 '24 07:07 tc-mb

I've summarized the issues for merging. If possible, please also take a look at the two example images. There might be a problem with some resolutions, especially the calculation image doesn't work at all. My suggestion/hope is that we can now bring it into a merge-able status, then that last error (calculation.png shows it closest) could maybe still be solved to get this into an excellent shape. I'm quite sure minicpm will dominate in usage for a longer time on llama.cpp

@cmp-nct hi,I have updated some of the merger issues mentioned above, and I can help you see if it is enough to continue the merger.

Regarding the immediate issue of orientation you mentioned, we also found and confirmed that this version of the model cannot be very stable in terms of this ability. We can only solve this problem in our next updated version.

I will update our gguf model on hf after merging this pr, so that the open source community can use the latest version of the model with official branches.

Here are the results of the above two example pictures：

Your results look great (does the calculation image also work now ?)

Can you please fix the cmake workflow ?

[cmake] CMake Error at examples/CMakeLists.txt:33 (add_subdirectory):
[cmake]   add_subdirectory given source "minicpmv" which is not an existing
[cmake]   directory.

It will also need a set(TARGET llama-minicpmv-cli) in the llava cmakelists.txt

Jul 18 '24 03:07 cmp-nct

@tc-mb did you see my response ? The cmake workflow must be fully functional. Once that works, we can ping GG to look for a merge but that only makes sense if the compilation won't fail immediately

Jul 18 '24 22:07 cmp-nct

@tc-mb did you see my response ? The cmake workflow must be fully functional. Once that works, we can ping GG to look for a merge but that only makes sense if the compilation won't fail immediately

@cmp-nct I'm sorry I didn't reply to you in time yesterday.

I'm not very familiar with cmake. If I make the wrong changes here, you can remind me at any time.

I borrowed from llama-llava-cli and added llama-minicpmv-cli. Can you check if this is feasible?

Jul 19 '24 03:07 tc-mb

@tc-mb did you see my response ? The cmake workflow must be fully functional. Once that works, we can ping GG to look for a merge but that only makes sense if the compilation won't fail immediately

@cmp-nct I'm sorry I didn't reply to you in time yesterday.

I'm not very familiar with cmake. If I make the wrong changes here, you can remind me at any time.

I borrowed from llama-llava-cli and added llama-minicpmv-cli. Can you check if this is feasible?

[cmake] CMake Error at examples/CMakeLists.txt:33 (add_subdirectory): [cmake] add_subdirectory given source "minicpmv" which is not an existing [cmake] directory.

The cmake and make support is essential, github automatically compiles the project and runs test suits. So with any error in Make/cmake this will not get a merge-status The outdated directory is still being added in the CMakeLists.txt line 33

Jul 19 '24 17:07 cmp-nct

@ggerganov this is almost ready for merge, it should compile once that directory bug is resolved The implementation is not flawless compared to reference but it's significantly better than all other llava/modal types we have on llama.cpp. Definitely better than our llava-1.6 implementation while also much more performant (way less projected tokens).

There are a couple improvements and caveats in this implementation but given how good it performs and how much work went into it already, we should get this merged and any med/major improvements can come later.

Otherwise we risk running the master away and losing the work.

Jul 19 '24 20:07 cmp-nct