sglang
sglang copied to clipboard
fix: apply cache size limit for VisionAttention
Motivation
enforce an upper-bound limit for VisionAttention mask cache size.
This is included in #3203, and moved to here.
Modifications
Checklist
- [ ] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
ref #3651
@yizhang2077 Will merge it after the CI.
@mickqian I use the latest version run qwen2.5-vl-7b model with command
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --port 8080 --chat-template qwen2-vl --chunked-prefill-size -1 --disable-radix-cache --mm-attention-backend fa3 --attention-backend fa3 --enable-torch-compile --cuda-graph-bs 80 --torch-compile-max-bs 80
then benchmark server with concurrency=80, after run sometime, server got OOM error