SpecForge [Feature] VLM model support tp

Motivation

support tp for qwen2.5 vl, the gpu memory is 78.62GB in tp1, 43.56GB in tp4.

add qwen2_5_vl.py for target model
add QKVParallelLinear for linear.py, because Qwen2_5_VLVisionAttention class need it.

https://github.com/sgl-project/SpecForge/issues/166

accuracy test
support tp8, because num_attention_heads in config.json can not be divide by 8.

[ ] Format your code according to the Code Formatting with Pre-Commit.
[ ] Add unit tests as outlined in the Running Unit Tests.
[ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[ ] Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

Sep 01 '25 12:09 KerwinKai

Clipboard_Screenshot_1757595669 The lm_head is column parallel, but you did not perform a gather operation here.

Sep 11 '25 13:09 oswen

@FrankLeeeee Hi, Could you help to review this PR about VL?

Oct 20 '25 14:10 zyksir

@KerwinKai Is it working properly now?

Nov 04 '25 09:11 ggg-s