[llamacpp] - Quality of F16 GGUF is (still) worse than the online demo
起始日期 | Start Date
No response
实现PR | Implementation PR
相关Issues | Reference Issues
摘要 | Summary
@tc-mb
Today I am using MiniCPM-Llama3-V 2.5 with the recently released official support in llama.cpp: https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/README-minicpmv2.5.md
I downloaded these files:
- https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/blob/main/ggml-model-BF16.gguf
- https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/blob/main/mmproj-model-f16_for_pr.gguf
Everything is running on an Apple Macbook Pro with an M3 Max and 128GB RAM.
I am using the attached sample image to do OCR tasks.
The 'Starting balance' is reported as '5,902.10' - which is wrong. The online demo (https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) reports it correct as '3,902.10).
基本示例 | Basic Example
./llama-minicpmv-cli -m MiniCPM-Llama3-V-2_5/ggml-model-F16.gguf --mmproj MiniCPM-Llama3-V-2_5/mmproj-model-f16_for_pr.gguf -c 4096 --temp 0.0 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image ccs.jpg -i
缺陷 | Drawbacks
未解决问题 | Unresolved questions
Hi, sorry for the late reply.
Most of the ggufs in this repository are old versions now. Although MiniCPM-V 2.5 has been merged into the official llamacpp, it has not been possible to update all of these ggufs.
Can you try to use this gguf test? I got the correct result when I tested it locally. https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/blob/main/ggml-model-Q4_K_M_for_pr.gguf
If there are still problems with accuracy, you can continue to send me the detailed usage process and I will continue to help answer them.
Hey @tc-mb!
Thanks for the updated instructions.
I just tried the linked GGUF above with:
./llama-minicpmv-cli -m "MiniCPM-Llama3-V-2_5/ggml-model-Q4_K_M_for_pr.gguf" --mmproj "MiniCPM-Llama3-V-2_5/mmproj-model-f16_for_pr.gguf" -c 4096 --temp 0.0 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image ccs.jpg -i
It gave the same wrong result for the 'starting balance'.
Any more ideas what I could do? Thanks!
I just tested the v2.5 model for single question and -i mode, and got the correct value. Can I ask which branch you used for the test? Because I'm afraid you might still be using old branches or old .o files.
I am using the official llama.cpp release. Latest commit from today.
Just completely deleted the clone, re-cloned llama.cpp, rebuilt everything. Same result... 🤔
Are you also running on Apple/MPS/Metal @tc-mb ?
I am using the official llama.cpp release. Latest commit from today.
platform: window version: llama-b3573-bin-win-cuda-cu12.2.0-x64
llama-minicpmv-cli -m C:/AI/llamaCPP/models/v2-5/ggml-model-F16.gguf --mmproj C:/AI/llamaCPP/models/v2-5/mmproj-model-f16.gguf -ngl 50 -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image C:/AI/llamaCPP/image/credit-card-statement.jpg -i
im getting good with the model above just , became .
well, I still use my fork llamacpp. I will try official llama.cpp now. I use MacBook Pro with M2.
well, I still use my fork llamacpp. I will try official llama.cpp now. I use MacBook Pro with M2.
Ah, OK. I thought that everything 'good' was already merged into official llama.cpp 🙂.
well, I still use my fork llamacpp. I will try official llama.cpp now. I use MacBook Pro with M2.
Are you using the same models and command line like I do?
I am using the official llama.cpp release. Latest commit from today.
platform: window version: llama-b3573-bin-win-cuda-cu12.2.0-x64
llama-minicpmv-cli -m C:/AI/llamaCPP/models/v2-5/ggml-model-F16.gguf --mmproj C:/AI/llamaCPP/models/v2-5/mmproj-model-f16.gguf -ngl 50 -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image C:/AI/llamaCPP/image/credit-card-statement.jpg -i
im getting good with the model above just , became .
Nice!
So, something has to be wrong here...
I am using the official llama.cpp release. Latest commit from today.
platform: window version: llama-b3573-bin-win-cuda-cu12.2.0-x64
llama-minicpmv-cli -m C:/AI/llamaCPP/models/v2-5/ggml-model-F16.gguf --mmproj C:/AI/llamaCPP/models/v2-5/mmproj-model-f16.gguf -ngl 50 -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image C:/AI/llamaCPP/image/credit-card-statement.jpg -i
im getting good with the model above just , became .
Seems you are even using the old .gguf files - and it still works for you.
@ChristianWeyer hi, I also verified the effect on the official branch, and it seems that there is no problem with muti-test.
I used the branch submitted to the official PR for testing before, just lazy to clone the official code and make it again. After all, the code merged into the official code has been run through the accuracy check.
I am also a little confused about the problem you are facing.
Could you please post a complete log of the command you ran? Maybe that will help me figure out what the problem is.
Which complete log are you referring to?
Which complete log are you referring to?
./llama-minicpmv-cli -m MiniCPM-Llama3-V-2_5/ggml-model-F16.gguf --mmproj MiniCPM-Llama3-V-2_5/mmproj-model-f16_for_pr.gguf -c 4096 --temp 0.0 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image ccs.jpg -p "what is startling balance"
You can execute this command once and send me all the output given by enter the command line. This log will be a little long, but I may find the possible source of the problem.
Interesting:
- I executed it 6 times in "-p" mode - it got it always right.
- Then I tried "-i" mode several times - and it got it right.
- Then I tried "-i" and asked it this way:
starting balance It answered: 0000000000000000
However, I currently cannot reproduce the answer with 5,902.10... 🤷🏼♂️
I'm glad you got the right result, although I'm not sure about the problem. Maybe the reason is that only asking a certain word will confuse the model.
Yeah. Something is really flaky.
Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!
I've got the starting balance right but i'm still getting performance difference compared to the demo when i try other questions.
I have more randomness via llama.cpp than the demo even with low temperature at 0.1.
Sometimes it gets things right and sometimes not on the same question.
I've tried full cpu and full gpu and i'm on windows 10, i'm using the last llama.cpp version with the PR in progress for v2.6
Yeah. Something is really flaky.
Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!
My internet connection to hf is not fast, so I can only expect to update it all in a week. But I can upload the f16 gguf today. I will notify you when it is uploaded.
@Wuzzooy I'm a little confused by what you said. Can I ask what version of the model you are running?
I just brought up v2.6 to say that i'm using the last llama.cpp with your PR in progress for v2.6 8967 but the model i use for this test is Llama3-V 2.5. When i ask for the starting balance on the image provided by the OP, i get the right answer but when i ask for the ending balance, it often answer 3,902.10 as shown on the screenshot while the one on the demo answer correctly each time. There are performance differences between the demo and the llama.cpp version on other questions too and i notice more randomness on the llama.cpp version.
I just brought up v2.6 to say that i'm using the last llama.cpp with your PR in progress for v2.6 8967 but the model i use for this test is Llama3-V 2.5. When i ask for the starting balance on the image provided by the OP, i get the right answer but when i ask for the ending balance, it often answer 3,902.10 as shown on the screenshot while the one on the demo answer correctly each time. There are performance differences between the demo and the llama.cpp version on other questions too and i notice more randomness on the llama.cpp version.
This may be normal, the demo we show uses Beam Search by default, which usually gets more stable answers.
Yeah. Something is really flaky. Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!
My internet connection to hf is not fast, so I can only expect to update it all in a week. But I can upload the f16 gguf today. I will notify you when it is uploaded.
Takes a veeery long time... 🙂
Yeah. Something is really flaky. Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!
My internet connection to hf is not fast, so I can only expect to update it all in a week. But I can upload the f16 gguf today. I will notify you when it is uploaded.
Takes a veeery long time... 🙂
It is really very difficult for me to upload hf on the Chinese Internet. Transmission often fails.
I've tried the 2.6 version and i didn't find any performance issue. I've tried to convert v2.5 with convert_hf_to_gguf.py but i have an error about the tokenizer
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:** There are 2 possible reasons for this:
WARNING:hf-to-gguf:** - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:** - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:** Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref: https://github.com/ggerganov/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh: 1baddeb572cd9de2a6d36f2ad0c361490bf5447dafca20afbac625e9d37f18a5
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:
Traceback (most recent call last):
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1457, in set_vocab
self._set_vocab_sentencepiece()
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 680, in _set_vocab_sentencepiece
tokens, scores, toktypes = self._create_vocab_sentencepiece()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 697, in _create_vocab_sentencepiece
raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: G:\AI\quantizeHFmodel-main\MiniCPM-Llama3-V-2_5\model\tokenizer.model
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1460, in set_vocab
self._set_vocab_llama_hf()
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 772, in _set_vocab_llama_hf
vocab = gguf.LlamaHfVocab(self.dir_model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\AI\quantizeHFmodel-main\gguf-py\gguf\vocab.py", line 368, in __init__
raise FileNotFoundError('Cannot find Llama BPE tokenizer')
FileNotFoundError: Cannot find Llama BPE tokenizer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 4065, in <module>
main()
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 4059, in main
model_instance.write()
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 388, in write
self.prepare_metadata(vocab_only=False)
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 381, in prepare_metadata
self.set_vocab()
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1463, in set_vocab
self._set_vocab_gpt2()
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 616, in _set_vocab_gpt2
tokens, toktypes, tokpre = self.get_vocab_base()
^^^^^^^^^^^^^^^^^^^^^
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 469, in get_vocab_base
tokpre = self.get_vocab_base_pre(tokenizer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 607, in get_vocab_base_pre
raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()
I don't have this converting issue with 2.6 version.
Yeah. Something is really flaky. Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!
My internet connection to hf is not fast, so I can only expect to update it all in a week. But I can upload the f16 gguf today. I will notify you when it is uploaded.
Takes a veeery long time... 🙂
Finally uploaded successfully today.
I've tried the 2.6 version and i didn't find any performance issue. I've tried to convert v2.5 with convert_hf_to_gguf.py but i have an error about the tokenizer
WARNING:hf-to-gguf:************************************************************************************** WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized! WARNING:hf-to-gguf:** There are 2 possible reasons for this: WARNING:hf-to-gguf:** - the model has not been added to convert_hf_to_gguf_update.py yet WARNING:hf-to-gguf:** - the pre-tokenization config has changed upstream WARNING:hf-to-gguf:** Check your model files and convert_hf_to_gguf_update.py and update them accordingly. WARNING:hf-to-gguf:** ref: https://github.com/ggerganov/llama.cpp/pull/6920 WARNING:hf-to-gguf:** WARNING:hf-to-gguf:** chkhsh: 1baddeb572cd9de2a6d36f2ad0c361490bf5447dafca20afbac625e9d37f18a5 WARNING:hf-to-gguf:************************************************************************************** WARNING:hf-to-gguf: Traceback (most recent call last): File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1457, in set_vocab self._set_vocab_sentencepiece() File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 680, in _set_vocab_sentencepiece tokens, scores, toktypes = self._create_vocab_sentencepiece() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 697, in _create_vocab_sentencepiece raise FileNotFoundError(f"File not found: {tokenizer_path}") FileNotFoundError: File not found: G:\AI\quantizeHFmodel-main\MiniCPM-Llama3-V-2_5\model\tokenizer.model During handling of the above exception, another exception occurred: Traceback (most recent call last): File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1460, in set_vocab self._set_vocab_llama_hf() File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 772, in _set_vocab_llama_hf vocab = gguf.LlamaHfVocab(self.dir_model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "G:\AI\quantizeHFmodel-main\gguf-py\gguf\vocab.py", line 368, in __init__ raise FileNotFoundError('Cannot find Llama BPE tokenizer') FileNotFoundError: Cannot find Llama BPE tokenizer During handling of the above exception, another exception occurred: Traceback (most recent call last): File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 4065, in <module> main() File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 4059, in main model_instance.write() File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 388, in write self.prepare_metadata(vocab_only=False) File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 381, in prepare_metadata self.set_vocab() File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 1463, in set_vocab self._set_vocab_gpt2() File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 616, in _set_vocab_gpt2 tokens, toktypes, tokpre = self.get_vocab_base() ^^^^^^^^^^^^^^^^^^^^^ File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 469, in get_vocab_base tokpre = self.get_vocab_base_pre(tokenizer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "G:\AI\quantizeHFmodel-main\convert_hf_to_gguf.py", line 607, in get_vocab_base_pre raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()") NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()I don't have this converting issue with 2.6 version.
May be you can try like this. add 'res = "llama-bpe"' in convert_hf_to_gguf.py 514 line.
May be you can try like this. add 'res = "llama-bpe"' in convert_hf_to_gguf.py 514 line.
Yes it works when adding this line.
Yeah. Something is really flaky. Will you also update the other GGUF files soon? I want to use the F16 one. Thanks!
My internet connection to hf is not fast, so I can only expect to update it all in a week. But I can upload the f16 gguf today. I will notify you when it is uploaded.
Takes a veeery long time... 🙂
Finally uploaded successfully today.
I tried 2.6 at the weekend. It is very good :-).