llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Add Qwen2.5VL support

Open HimariO opened this issue 8 months ago • 15 comments

Original issue: #11483

Changes

  • Add new gguf key for clip model to support
    • GLU MLP,
    • window attention,
    • RMS norm
  • Updated clip.cpp vision model to incorporate these new components.
  • Modified qwen2_vl_surgery.py and convert_hf_to_gguf.py to support the Qwen2.5VL model.

Model Conversion

The only change in the conversion process compared to Qwen2VL is the addition of the model_type parameter when creating the vision encoder GGUF file. (For the rest of the process and how to build llama-qwen2vl-cli, refer to #10361.)

PYTHONPATH=$PYTHONPATH:$(pwd)/gguf-py python3 examples/llava/qwen2_vl_surgery.py "/path/to/model" --data_type fp16 --model_type "qwen2.5vl"

HimariO avatar Mar 15 '25 19:03 HimariO

I have not converted models with the surgery myself, but I can confirm that those uploaded at https://huggingface.co/samgreen/Qwen2.5-VL-7B-Instruct-GGUF are working correctly with your changes.

LostRuins avatar Apr 01 '25 15:04 LostRuins

Waiting for this! 🙌 🙏

thomasht86 avatar Apr 03 '25 12:04 thomasht86

@HimariO I am having trouble with the model output just stopping while only partly done answering. I am using something like:

llama-qwen2vl-cli -m ./Qwen25-VL/Qwen2.5-VL-7B-Instruct.gguf --mmproj ./Qwen25-VL/qwen25-vll-vision.gguf --image ./test.png -t 12 --threads 12 --ctx-size 128000 --batch-size 32 -j "{}" --ignore-eos -n -1

Do you have any thoughts on what might be the cause?

abalmos avatar Apr 12 '25 20:04 abalmos

Works fine for me, why are you using --ignore-eos? Also instead of setting -n to -1 does it happen if you set it to a large number or does it still stop prematurely?

LostRuins avatar Apr 13 '25 03:04 LostRuins

I am still trying to figure this space out. Some Googling suggested that EOS can sometimes be a problem and that --ignore-eos could help. I now see how that wouldn't help here.

@LostRuins Thanks! You were right, making -n very large allowed for the output to finish. I guess -n -1 does not work for this model @HimariO ?

abalmos avatar Apr 13 '25 13:04 abalmos

@abalmos seems like process_prompt function will set the number of output tokens to 256 if you set it to -1(or just leave it as default). And since qwen2vl-cli is based on llaval-cli, another model will also have this behavior.

HimariO avatar Apr 13 '25 13:04 HimariO

@HimariO Thanks. The model did produce output, even with all the config at default, it would just stop too early. The flags I had in my first comment were just all the things I tried changing. Testing again, only the -n flag is actually needed for the JSON output that I was excepting. Based on your last comment, that seems to be understood and make sense.

Everything is working as expected with the default and a -n flag. Thanks!

abalmos avatar Apr 13 '25 15:04 abalmos

@ggerganov, I think this PR is ready for review. Please take a look when you have a moment.

HimariO avatar Apr 14 '25 16:04 HimariO

Tested with this model: https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B All steps, convertion of LLM part, LLM part was quantized to q5_k_m, conversion of vision part and inference.

Results

I had to save prompt to a file, and specify it like:

llama-qwen2vl-cli ... -p "$(cat ./prompt.txt)"

prompt.txt

You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.

## Output Format
```
Thought: ...
Action: ...
```

## Action Space

click(start_box='(x1,y1)')
left_double(start_box='(x1,y1)')
right_single(start_box='(x1,y1)')
drag(start_box='(x1,y1)', end_box='(x3,y3)')
hotkey(key='')
type(content='xxx')
scroll(start_box='(x1,y1)', direction='down or up or right or left')
wait() #Sleep for 5s and take a screenshot to check for any changes.
finished(content='xxx')


## Note
- Use English in `Thought` part.
- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.

## User Instruction
Open google and search for Henry Ford previous job title
Scrx

Model answer:

Thought: I've accessed the Google Translate page. To start searching for the previous job title of Henry Ford, I need to enter the text into the left input box. The first step is to click on the input box so that I can type in the keywords.
Action: click(start_box='<|box_start|>(290,286)<|box_end|>')

Thank you, @HimariO ! It's great!

CoruNethron avatar Apr 18 '25 05:04 CoruNethron

For what it's worth, I have been using the qwen2.5vl implementation from this PR and it is working well enough, at least based on the results it generates.

LostRuins avatar Apr 25 '25 15:04 LostRuins

My main concern is make sure this PR not putting to many resistance on adding future models (like qwen 3 or something else about to be released, who knows?). clip.cpp has been in a very poor shape (cumbersome code, memory leaks, etc) before I refactor it, so it's better to pay attention more to the coding style for each new model.

ngxson avatar Apr 25 '25 16:04 ngxson

@ngxson I've incorporated most of the changes we discussed. I think there aren't too many things left to update, it won't take me much time to finish this PR.

HimariO avatar Apr 26 '25 12:04 HimariO

Thanks @HimariO for taking your time. It looks good to me overall, I'll look deeper and push some commits directly to this PR.

One thing would be nice to add though, there is a test file here that you can add your pre-quantized model. The current conversion pipeline involves 3 steps: surgery, convert text model, convert vision encoder, so I'm not sure if I will have time to test this. So if you have a pre-quantized GGUF, feel free to share.

In a follow-up PR, I'll try to move qwen2 mmproj conversion into convert_hf_to_gguf.py. Newer models like gemma 3, smolvlm and pixtral are already using this approach.

Btw, @LostRuins could you also verify if this PR is still working correctly?

ngxson avatar Apr 26 '25 13:04 ngxson

Merging this once the CI is green. I also added the test I mentioned in the last comment.

Here is the pre-quantized Qwen2.5-VL 3B:

llama-qwen2vl-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF

More quants will be available once convert_hf_to_gguf.py script support the mmproj conversion

ngxson avatar Apr 26 '25 21:04 ngxson

I did a quick test with my existing quants but they didn't work - though I see the q2vl surgery file has been changed and I would probably need to reconstruct the mmproj? I will redownload the model and try that again later.

LostRuins avatar Apr 27 '25 05:04 LostRuins

Hello @ngxson , the newest PR is not working correctly for me.

I reconverted https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct with the new surgery script, and the resulting mmproj loaded. However, when trying to perform inference I get a segfault.

I then tried your new quants at https://huggingface.co/ggml-org/Qwen2.5-VL-3B-Instruct-GGUF with the same result, also segfault.

Segfault seems to be happening at ggml_nbytes() from this line https://github.com/HimariO/llama.cpp.qwen2.5vl/blob/qwen25-vl/examples/llava/clip.cpp#L3272. Looking closer, I think it should not even be in that branch, previously in @HimariO version it is captured by the if (ctx->has_qwen2vl_merger) { check.

https://github.com/HimariO/llama.cpp.qwen2.5vl/blob/53a15d014f2dd0a1409e49da097dba891c629f6e/examples/llava/clip.cpp#L3142

I tried replacing
if (ctx->proj_type == PROJECTOR_TYPE_QWEN2VL) { with if (ctx->proj_type == PROJECTOR_TYPE_QWEN2VL || ctx->proj_type == PROJECTOR_TYPE_QWEN25VL) {

but that is not enough. the inference proceeds, but the output is wrong. I think that has_qwen2vl_merger was used in multiple places, and therefore some of them need to match against PROJECTOR_TYPE_QWEN25VL as well

Qwen2VL, gemma and others still work fine.

LostRuins avatar Apr 27 '25 08:04 LostRuins

Also, I think qwen2.5vl 3B model is broken due to unrelated reasons. The output is completely incoherent. So it's not a good example to compare against.

Anyway @ngxson , I did this fix and it seems to work now with the latest 7B q2.5vl quants, possibly overkill but do take a look https://github.com/LostRuins/koboldcpp/commit/f8b7ddeac0bf3b6a46b3bd2dd008cd08e32b1f3a - at least in practice it seems to work fine.

For those who want to try with older mmproj without reconverting, this ugly hack will allow you to load them as well https://github.com/LostRuins/koboldcpp/commit/37060f54da90d3a466c439e5c2063fd718c33f13 (though im sure its out of scope for lcpp)

Edit: PR for fix https://github.com/ggml-org/llama.cpp/pull/13133

LostRuins avatar Apr 27 '25 08:04 LostRuins

Please note that if you're distributing GGUF from a WIP PR, it's your responsibility to update it. For the same reason, I don't recommend distributing GGUF publicly before merging, unless the PR reaches a final review state.

ngxson avatar Apr 27 '25 10:04 ngxson

I'm testing with the 32B model and I realized that the text model does not work, it responses with @@@@@@@@@... repeatedly. @LostRuins have you tested with Qwen 2.5 VL 32B? Here is our pre-quant: https://huggingface.co/ggml-org/Qwen2.5-VL-32B-Instruct-GGUF

I'm trying with llama-cli btw, no vision here.

@HimariO Also, I think the 32B model does use fullatt_block_indexes. The reason why you see it missing from config.json was because transformers exclude some keys from json if they are the same with default value. I don't know why it only happen with 32B, but I'm pretty sure it's the case here. If the model didn't use that, it would have been an empty array: fullatt_block_indexes: []

ngxson avatar Apr 30 '25 20:04 ngxson

Hello @ngxson , I just tried the text model alone, https://huggingface.co/ggml-org/Qwen2.5-VL-32B-Instruct-GGUF/blob/main/Qwen2.5-VL-32B-Instruct-Q4_K_M.gguf, loaded 40 layers with Vulkan backend. And it works perfectly fine.

Tried again with CUDA backend, 40 layers, also no issue.

I did not test flash attention.

I am on Windows 10 x86_64, Nvidia RTX 4090 laptop (driver 566.36 so no coopmat2), Intel i9-13980hx.

Did you use the exact same quant download as above? Can you give me a set of launch args that do not work?

LostRuins avatar May 01 '25 09:05 LostRuins

Yes, I use the exact quant above, the command is llama-cli -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF:Q4_K_M

I haven't tested with CPU-only, but if that works on CUDA and Vulkan, probably there is a problem with Metal backend

ngxson avatar May 01 '25 10:05 ngxson

@ngxson The 32B model works on my M2 Studio:

image

ggerganov avatar May 02 '25 07:05 ggerganov

This is how it looks on my system (mac M3 ultra)

image

Note: the llamac is my custom bash macro to run cmake && binary at the same time


with -ngl 0 and -DGGML_METAL=OFF:

image

ngxson avatar May 02 '25 08:05 ngxson

I can't test for mac, but i can confirm coherence on cuda, vulkan and cpu

LostRuins avatar May 02 '25 08:05 LostRuins

@ngxson Interestingly when I convert 72B VL, I get extremely high perplexity values for Qwen 2.5 VL 72B Instruct. I'm getting 20 to 70 weirdly after BF16 conversion

danielhanchen avatar May 18 '25 06:05 danielhanchen

@HimariO D:\work\AI\llama.cpp.qwen2.5vl-master>python examples/llava/qwen2_vl_surgery.py "D:\work\AI\Model\TestModel2" --data_type fp16 --model_type "qwen2.5vl" usage: qwen2_vl_surgery.py [-h] [--data_type [{fp32,fp16}]] [model_name] qwen2_vl_surgery.py: error: unrecognized arguments: --model_type qwen2.5vl pls help

YuBin8 avatar Jul 15 '25 06:07 YuBin8