SpecForge
SpecForge copied to clipboard
[Feature] VLM model support tp
Motivation
support tp for qwen2.5 vl, the gpu memory is 78.62GB in tp1, 43.56GB in tp4.
Modifications
-
add qwen2_5_vl.py for target model
-
add
QKVParallelLinearfor linear.py, becauseQwen2_5_VLVisionAttentionclass need it.
Related Issues
https://github.com/sgl-project/SpecForge/issues/166
Pedding todo
-
accuracy test
-
support tp8, because
num_attention_headsin config.json can not be divide by 8.
Checklist
- [ ] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.
@FrankLeeeee Hi, Could you help to review this PR about VL?
@KerwinKai Is it working properly now?