MiniCPM-V [llamacpp] - Quality of F16 GGUF is (still) worse than the online demo

起始日期 | Start Date

No response

实现PR | Implementation PR

摘要 | Summary

@tc-mb

Today I am using MiniCPM-Llama3-V 2.5 with the recently released official support in llama.cpp: https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/README-minicpmv2.5.md

I downloaded these files:

https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/blob/main/ggml-model-BF16.gguf
https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/blob/main/mmproj-model-f16_for_pr.gguf

Everything is running on an Apple Macbook Pro with an M3 Max and 128GB RAM.

I am using the attached sample image to do OCR tasks. ccs

The 'Starting balance' is reported as '5,902.10' - which is wrong. The online demo (https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) reports it correct as '3,902.10).

基本示例 | Basic Example

./llama-minicpmv-cli -m MiniCPM-Llama3-V-2_5/ggml-model-F16.gguf --mmproj MiniCPM-Llama3-V-2_5/mmproj-model-f16_for_pr.gguf -c 4096 --temp 0.0 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image ccs.jpg -i

缺陷 | Drawbacks

未解决问题 | Unresolved questions

Aug 11 '24 15:08 ChristianWeyer

Hi, sorry for the late reply.

Most of the ggufs in this repository are old versions now. Although MiniCPM-V 2.5 has been merged into the official llamacpp, it has not been possible to update all of these ggufs.

Can you try to use this gguf test? I got the correct result when I tested it locally. https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/blob/main/ggml-model-Q4_K_M_for_pr.gguf

If there are still problems with accuracy, you can continue to send me the detailed usage process and I will continue to help answer them.

Aug 14 '24 10:08 tc-mb

Hey @tc-mb!

Thanks for the updated instructions.

I just tried the linked GGUF above with:

./llama-minicpmv-cli -m "MiniCPM-Llama3-V-2_5/ggml-model-Q4_K_M_for_pr.gguf" --mmproj "MiniCPM-Llama3-V-2_5/mmproj-model-f16_for_pr.gguf" -c 4096 --temp 0.0 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image ccs.jpg -i

It gave the same wrong result for the 'starting balance'.

Any more ideas what I could do? Thanks!

Aug 14 '24 10:08 ChristianWeyer

I just tested the v2.5 model for single question and -i mode, and got the correct value. Can I ask which branch you used for the test? Because I'm afraid you might still be using old branches or old .o files.

Aug 14 '24 11:08 tc-mb

I am using the official llama.cpp release. Latest commit from today.

Aug 14 '24 11:08 ChristianWeyer

Just completely deleted the clone, re-cloned llama.cpp, rebuilt everything. Same result... 🤔

Aug 14 '24 11:08 ChristianWeyer

Are you also running on Apple/MPS/Metal @tc-mb ?

Aug 14 '24 11:08 ChristianWeyer

I am using the official llama.cpp release. Latest commit from today.

platform: window version: llama-b3573-bin-win-cuda-cu12.2.0-x64

llama-minicpmv-cli -m C:/AI/llamaCPP/models/v2-5/ggml-model-F16.gguf --mmproj C:/AI/llamaCPP/models/v2-5/mmproj-model-f16.gguf -ngl 50 -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image C:/AI/llamaCPP/image/credit-card-statement.jpg -i

im getting good with the model above just , became .

Aug 14 '24 11:08 kang2000

well, I still use my fork llamacpp. I will try official llama.cpp now. I use MacBook Pro with M2.

Aug 14 '24 11:08 tc-mb

well, I still use my fork llamacpp. I will try official llama.cpp now. I use MacBook Pro with M2.

Ah, OK. I thought that everything 'good' was already merged into official llama.cpp 🙂.

Aug 14 '24 12:08 ChristianWeyer

well, I still use my fork llamacpp. I will try official llama.cpp now. I use MacBook Pro with M2.

Are you using the same models and command line like I do?

Aug 14 '24 12:08 ChristianWeyer

I am using the official llama.cpp release. Latest commit from today.

platform: window version: llama-b3573-bin-win-cuda-cu12.2.0-x64

llama-minicpmv-cli -m C:/AI/llamaCPP/models/v2-5/ggml-model-F16.gguf --mmproj C:/AI/llamaCPP/models/v2-5/mmproj-model-f16.gguf -ngl 50 -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image C:/AI/llamaCPP/image/credit-card-statement.jpg -i

im getting good with the model above just , became .

Nice!

So, something has to be wrong here...

Aug 14 '24 12:08 ChristianWeyer

I am using the official llama.cpp release. Latest commit from today.

platform: window version: llama-b3573-bin-win-cuda-cu12.2.0-x64

llama-minicpmv-cli -m C:/AI/llamaCPP/models/v2-5/ggml-model-F16.gguf --mmproj C:/AI/llamaCPP/models/v2-5/mmproj-model-f16.gguf -ngl 50 -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image C:/AI/llamaCPP/image/credit-card-statement.jpg -i

im getting good with the model above just , became .

Seems you are even using the old .gguf files - and it still works for you.

Aug 14 '24 12:08 ChristianWeyer

NByCzuuQT7 zUtrHuXjBn

@ChristianWeyer hi, I also verified the effect on the official branch, and it seems that there is no problem with muti-test.

I used the branch submitted to the official PR for testing before, just lazy to clone the official code and make it again. After all, the code merged into the official code has been run through the accuracy check.

I am also a little confused about the problem you are facing.

Could you please post a complete log of the command you ran? Maybe that will help me figure out what the problem is.

Aug 14 '24 13:08 tc-mb

Which complete log are you referring to?

Aug 14 '24 13:08 ChristianWeyer

Which complete log are you referring to?

./llama-minicpmv-cli -m MiniCPM-Llama3-V-2_5/ggml-model-F16.gguf --mmproj MiniCPM-Llama3-V-2_5/mmproj-model-f16_for_pr.gguf -c 4096 --temp 0.0 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image ccs.jpg -p "what is startling balance"

You can execute this command once and send me all the output given by enter the command line. This log will be a little long, but I may find the possible source of the problem.

Aug 14 '24 13:08 tc-mb

Interesting:

I executed it 6 times in "-p" mode - it got it always right.
Then I tried "-i" mode several times - and it got it right.
Then I tried "-i" and asked it this way: starting balance It answered: 0000000000000000

However, I currently cannot reproduce the answer with 5,902.10... 🤷🏼‍♂️

Aug 14 '24 13:08 ChristianWeyer

I'm glad you got the right result, although I'm not sure about the problem. Maybe the reason is that only asking a certain word will confuse the model.

Aug 14 '24 13:08 tc-mb

Yeah. Something is really flaky.

Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!

Aug 14 '24 14:08 ChristianWeyer

I've got the starting balance right but i'm still getting performance difference compared to the demo when i try other questions. I have more randomness via llama.cpp than the demo even with low temperature at 0.1. Sometimes it gets things right and sometimes not on the same question. I've tried full cpu and full gpu and i'm on windows 10, i'm using the last llama.cpp version with the PR in progress for v2.6 test

compares

Aug 14 '24 14:08 Wuzzooy

Yeah. Something is really flaky.

Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!

My internet connection to hf is not fast, so I can only expect to update it all in a week. But I can upload the f16 gguf today. I will notify you when it is uploaded.

Aug 15 '24 02:08 tc-mb

@Wuzzooy I'm a little confused by what you said. Can I ask what version of the model you are running?

Aug 15 '24 02:08 tc-mb

I just brought up v2.6 to say that i'm using the last llama.cpp with your PR in progress for v2.6 8967 but the model i use for this test is Llama3-V 2.5. When i ask for the starting balance on the image provided by the OP, i get the right answer but when i ask for the ending balance, it often answer 3,902.10 as shown on the screenshot while the one on the demo answer correctly each time. There are performance differences between the demo and the llama.cpp version on other questions too and i notice more randomness on the llama.cpp version.

Aug 15 '24 07:08 Wuzzooy

I just brought up v2.6 to say that i'm using the last llama.cpp with your PR in progress for v2.6 8967 but the model i use for this test is Llama3-V 2.5. When i ask for the starting balance on the image provided by the OP, i get the right answer but when i ask for the ending balance, it often answer 3,902.10 as shown on the screenshot while the one on the demo answer correctly each time. There are performance differences between the demo and the llama.cpp version on other questions too and i notice more randomness on the llama.cpp version.

This may be normal, the demo we show uses Beam Search by default, which usually gets more stable answers.

Aug 15 '24 07:08 tc-mb

Yeah. Something is really flaky. Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!

My internet connection to hf is not fast, so I can only expect to update it all in a week. But I can upload the f16 gguf today. I will notify you when it is uploaded.

Takes a veeery long time... 🙂

Aug 16 '24 10:08 ChristianWeyer

Yeah. Something is really flaky. Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!

My internet connection to hf is not fast, so I can only expect to update it all in a week. But I can upload the f16 gguf today. I will notify you when it is uploaded.

Takes a veeery long time... 🙂

It is really very difficult for me to upload hf on the Chinese Internet. Transmission often fails.

Aug 16 '24 10:08 tc-mb

I've tried the 2.6 version and i didn't find any performance issue. I've tried to convert v2.5 with convert_hf_to_gguf.py but i have an error about the tokenizer

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  1baddeb572cd9de2a6d36f2ad0c361490bf5447dafca20afbac625e9d37f18a5
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:

Traceback (most recent call last):
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1457, in set_vocab
    self._set_vocab_sentencepiece()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 680, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 697, in _create_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: G:\AI\quantizeHFmodel-main\MiniCPM-Llama3-V-2_5\model\tokenizer.model

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1460, in set_vocab
    self._set_vocab_llama_hf()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 772, in _set_vocab_llama_hf
    vocab = gguf.LlamaHfVocab(self.dir_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\AI\quantizeHFmodel-main\gguf-py\gguf\vocab.py", line 368, in __init__
    raise FileNotFoundError('Cannot find Llama BPE tokenizer')
FileNotFoundError: Cannot find Llama BPE tokenizer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 4065, in <module>
    main()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 4059, in main
    model_instance.write()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 388, in write
    self.prepare_metadata(vocab_only=False)
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 381, in prepare_metadata
    self.set_vocab()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1463, in set_vocab
    self._set_vocab_gpt2()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 616, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 469, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 607, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

I don't have this converting issue with 2.6 version.

Aug 18 '24 07:08 Wuzzooy

Yeah. Something is really flaky. Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!

My internet connection to hf is not fast, so I can only expect to update it all in a week. But I can upload the f16 gguf today. I will notify you when it is uploaded.

Takes a veeery long time... 🙂

Finally uploaded successfully today.

Aug 19 '24 06:08 tc-mb

I've tried the 2.6 version and i didn't find any performance issue. I've tried to convert v2.5 with convert_hf_to_gguf.py but i have an error about the tokenizer

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  1baddeb572cd9de2a6d36f2ad0c361490bf5447dafca20afbac625e9d37f18a5
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:

Traceback (most recent call last):
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1457, in set_vocab
    self._set_vocab_sentencepiece()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 680, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 697, in _create_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: G:\AI\quantizeHFmodel-main\MiniCPM-Llama3-V-2_5\model\tokenizer.model

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1460, in set_vocab
    self._set_vocab_llama_hf()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 772, in _set_vocab_llama_hf
    vocab = gguf.LlamaHfVocab(self.dir_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\AI\quantizeHFmodel-main\gguf-py\gguf\vocab.py", line 368, in __init__
    raise FileNotFoundError('Cannot find Llama BPE tokenizer')
FileNotFoundError: Cannot find Llama BPE tokenizer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 4065, in <module>
    main()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 4059, in main
    model_instance.write()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 388, in write
    self.prepare_metadata(vocab_only=False)
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 381, in prepare_metadata
    self.set_vocab()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1463, in set_vocab
    self._set_vocab_gpt2()
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 616, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 469, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 607, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

I don't have this converting issue with 2.6 version.

May be you can try like this. add 'res = "llama-bpe"' in convert_hf_to_gguf.py 514 line.

Aug 19 '24 07:08 tc-mb

May be you can try like this. add 'res = "llama-bpe"' in convert_hf_to_gguf.py 514 line.

Yes it works when adding this line.

Aug 19 '24 07:08 Wuzzooy

Yeah. Something is really flaky. Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!

My internet connection to hf is not fast, so I can only expect to update it all in a week. But I can upload the f16 gguf today. I will notify you when it is uploaded.

Takes a veeery long time... 🙂

Finally uploaded successfully today.

I tried 2.6 at the weekend. It is very good :-).

Aug 19 '24 07:08 ChristianWeyer

[llamacpp] - Quality of F16 GGUF is (still) worse than the online demo

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions