sglang fix: apply cache size limit for VisionAttention

Motivation

enforce an upper-bound limit for VisionAttention mask cache size.

This is included in #3203, and moved to here.

Modifications

Checklist

[ ] Format your code according to the Code Formatting with Pre-Commit.
[ ] Add unit tests as outlined in the Running Unit Tests.
[ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Feb 18 '25 06:02 mickqian

ref #3651

Feb 18 '25 06:02 mickqian

@yizhang2077 Will merge it after the CI.

Feb 19 '25 08:02 zhaochenyang20

@mickqian I use the latest version run qwen2.5-vl-7b model with command

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --port 8080  --chat-template qwen2-vl --chunked-prefill-size -1 --disable-radix-cache --mm-attention-backend fa3 --attention-backend fa3  --enable-torch-compile --cuda-graph-bs 80 --torch-compile-max-bs 80

then benchmark server with concurrency=80, after run sometime, server got OOM error

Jul 01 '25 10:07 Lzhang-hub