llama.cpp
llama.cpp copied to clipboard
Add Qwen2.5VL support
Original issue: #11483
Changes
- Add new gguf key for clip model to support
- GLU MLP,
- window attention,
- RMS norm
- Updated
clip.cppvision model to incorporate these new components. - Modified
qwen2_vl_surgery.pyandconvert_hf_to_gguf.pyto support the Qwen2.5VL model.
Model Conversion
The only change in the conversion process compared to Qwen2VL is the addition of the model_type parameter when creating the vision encoder GGUF file. (For the rest of the process and how to build llama-qwen2vl-cli, refer to #10361.)
PYTHONPATH=$PYTHONPATH:$(pwd)/gguf-py python3 examples/llava/qwen2_vl_surgery.py "/path/to/model" --data_type fp16 --model_type "qwen2.5vl"
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High
I have not converted models with the surgery myself, but I can confirm that those uploaded at https://huggingface.co/samgreen/Qwen2.5-VL-7B-Instruct-GGUF are working correctly with your changes.
Waiting for this! 🙌 🙏
@HimariO I am having trouble with the model output just stopping while only partly done answering. I am using something like:
llama-qwen2vl-cli -m ./Qwen25-VL/Qwen2.5-VL-7B-Instruct.gguf --mmproj ./Qwen25-VL/qwen25-vll-vision.gguf --image ./test.png -t 12 --threads 12 --ctx-size 128000 --batch-size 32 -j "{}" --ignore-eos -n -1
Do you have any thoughts on what might be the cause?
Works fine for me, why are you using --ignore-eos? Also instead of setting -n to -1 does it happen if you set it to a large number or does it still stop prematurely?
I am still trying to figure this space out. Some Googling suggested that EOS can sometimes be a problem and that --ignore-eos could help. I now see how that wouldn't help here.
@LostRuins Thanks! You were right, making -n very large allowed for the output to finish. I guess -n -1 does not work for this model @HimariO ?
@abalmos seems like process_prompt function will set the number of output tokens to 256 if you set it to -1(or just leave it as default). And since qwen2vl-cli is based on llaval-cli, another model will also have this behavior.
@HimariO Thanks. The model did produce output, even with all the config at default, it would just stop too early. The flags I had in my first comment were just all the things I tried changing. Testing again, only the -n flag is actually needed for the JSON output that I was excepting. Based on your last comment, that seems to be understood and make sense.
Everything is working as expected with the default and a -n flag. Thanks!
@ggerganov, I think this PR is ready for review. Please take a look when you have a moment.
Tested with this model: https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B All steps, convertion of LLM part, LLM part was quantized to q5_k_m, conversion of vision part and inference.
Results
I had to save prompt to a file, and specify it like:
llama-qwen2vl-cli ... -p "$(cat ./prompt.txt)"
prompt.txt
You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format ``` Thought: ... Action: ... ``` ## Action Space click(start_box='(x1,y1)') left_double(start_box='(x1,y1)') right_single(start_box='(x1,y1)') drag(start_box='(x1,y1)', end_box='(x3,y3)') hotkey(key='') type(content='xxx') scroll(start_box='(x1,y1)', direction='down or up or right or left') wait() #Sleep for 5s and take a screenshot to check for any changes. finished(content='xxx') ## Note - Use English in `Thought` part. - Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part. ## User Instruction Open google and search for Henry Ford previous job title
Model answer:
Thought: I've accessed the Google Translate page. To start searching for the previous job title of Henry Ford, I need to enter the text into the left input box. The first step is to click on the input box so that I can type in the keywords.
Action: click(start_box='<|box_start|>(290,286)<|box_end|>')
Thank you, @HimariO ! It's great!
For what it's worth, I have been using the qwen2.5vl implementation from this PR and it is working well enough, at least based on the results it generates.
My main concern is make sure this PR not putting to many resistance on adding future models (like qwen 3 or something else about to be released, who knows?). clip.cpp has been in a very poor shape (cumbersome code, memory leaks, etc) before I refactor it, so it's better to pay attention more to the coding style for each new model.
@ngxson I've incorporated most of the changes we discussed. I think there aren't too many things left to update, it won't take me much time to finish this PR.
Thanks @HimariO for taking your time. It looks good to me overall, I'll look deeper and push some commits directly to this PR.
One thing would be nice to add though, there is a test file here that you can add your pre-quantized model. The current conversion pipeline involves 3 steps: surgery, convert text model, convert vision encoder, so I'm not sure if I will have time to test this. So if you have a pre-quantized GGUF, feel free to share.
In a follow-up PR, I'll try to move qwen2 mmproj conversion into convert_hf_to_gguf.py. Newer models like gemma 3, smolvlm and pixtral are already using this approach.
Btw, @LostRuins could you also verify if this PR is still working correctly?
Merging this once the CI is green. I also added the test I mentioned in the last comment.
Here is the pre-quantized Qwen2.5-VL 3B:
llama-qwen2vl-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
More quants will be available once convert_hf_to_gguf.py script support the mmproj conversion
I did a quick test with my existing quants but they didn't work - though I see the q2vl surgery file has been changed and I would probably need to reconstruct the mmproj? I will redownload the model and try that again later.
Hello @ngxson , the newest PR is not working correctly for me.
I reconverted https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct with the new surgery script, and the resulting mmproj loaded. However, when trying to perform inference I get a segfault.
I then tried your new quants at https://huggingface.co/ggml-org/Qwen2.5-VL-3B-Instruct-GGUF with the same result, also segfault.
Segfault seems to be happening at ggml_nbytes() from this line https://github.com/HimariO/llama.cpp.qwen2.5vl/blob/qwen25-vl/examples/llava/clip.cpp#L3272. Looking closer, I think it should not even be in that branch, previously in @HimariO version it is captured by the if (ctx->has_qwen2vl_merger) { check.
https://github.com/HimariO/llama.cpp.qwen2.5vl/blob/53a15d014f2dd0a1409e49da097dba891c629f6e/examples/llava/clip.cpp#L3142
I tried replacing
if (ctx->proj_type == PROJECTOR_TYPE_QWEN2VL) {
with
if (ctx->proj_type == PROJECTOR_TYPE_QWEN2VL || ctx->proj_type == PROJECTOR_TYPE_QWEN25VL) {
but that is not enough. the inference proceeds, but the output is wrong. I think that has_qwen2vl_merger was used in multiple places, and therefore some of them need to match against PROJECTOR_TYPE_QWEN25VL as well
Qwen2VL, gemma and others still work fine.
Also, I think qwen2.5vl 3B model is broken due to unrelated reasons. The output is completely incoherent. So it's not a good example to compare against.
Anyway @ngxson , I did this fix and it seems to work now with the latest 7B q2.5vl quants, possibly overkill but do take a look https://github.com/LostRuins/koboldcpp/commit/f8b7ddeac0bf3b6a46b3bd2dd008cd08e32b1f3a - at least in practice it seems to work fine.
For those who want to try with older mmproj without reconverting, this ugly hack will allow you to load them as well https://github.com/LostRuins/koboldcpp/commit/37060f54da90d3a466c439e5c2063fd718c33f13 (though im sure its out of scope for lcpp)
Edit: PR for fix https://github.com/ggml-org/llama.cpp/pull/13133
Please note that if you're distributing GGUF from a WIP PR, it's your responsibility to update it. For the same reason, I don't recommend distributing GGUF publicly before merging, unless the PR reaches a final review state.
I'm testing with the 32B model and I realized that the text model does not work, it responses with @@@@@@@@@... repeatedly. @LostRuins have you tested with Qwen 2.5 VL 32B? Here is our pre-quant: https://huggingface.co/ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
I'm trying with llama-cli btw, no vision here.
@HimariO Also, I think the 32B model does use fullatt_block_indexes. The reason why you see it missing from config.json was because transformers exclude some keys from json if they are the same with default value. I don't know why it only happen with 32B, but I'm pretty sure it's the case here. If the model didn't use that, it would have been an empty array: fullatt_block_indexes: []
Hello @ngxson , I just tried the text model alone, https://huggingface.co/ggml-org/Qwen2.5-VL-32B-Instruct-GGUF/blob/main/Qwen2.5-VL-32B-Instruct-Q4_K_M.gguf, loaded 40 layers with Vulkan backend. And it works perfectly fine.
Tried again with CUDA backend, 40 layers, also no issue.
I did not test flash attention.
I am on Windows 10 x86_64, Nvidia RTX 4090 laptop (driver 566.36 so no coopmat2), Intel i9-13980hx.
Did you use the exact same quant download as above? Can you give me a set of launch args that do not work?
Yes, I use the exact quant above, the command is llama-cli -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF:Q4_K_M
I haven't tested with CPU-only, but if that works on CUDA and Vulkan, probably there is a problem with Metal backend
@ngxson The 32B model works on my M2 Studio:
This is how it looks on my system (mac M3 ultra)
Note: the llamac is my custom bash macro to run cmake && binary at the same time
with -ngl 0 and -DGGML_METAL=OFF:
I can't test for mac, but i can confirm coherence on cuda, vulkan and cpu
@ngxson Interestingly when I convert 72B VL, I get extremely high perplexity values for Qwen 2.5 VL 72B Instruct. I'm getting 20 to 70 weirdly after BF16 conversion
@HimariO D:\work\AI\llama.cpp.qwen2.5vl-master>python examples/llava/qwen2_vl_surgery.py "D:\work\AI\Model\TestModel2" --data_type fp16 --model_type "qwen2.5vl" usage: qwen2_vl_surgery.py [-h] [--data_type [{fp32,fp16}]] [model_name] qwen2_vl_surgery.py: error: unrecognized arguments: --model_type qwen2.5vl pls help