DefTruth
DefTruth
I use vllm==0.7.4.dev145+g73e0225ee
Enabling chunked prefill and CUDA graph may lead to unbalanced VRAM usage.
> Please submit a minimal reproducible code or command. run DeepSeek-R1-Distill-Qwen-32B on L20x4: ```bash nohup python3 -m vllm.entrypoints.openai.api_server \ --model /workspace/hf_models/DeepSeek-R1-Distill-Qwen-32B \ --tensor-parallel-size 4 \ --max-model-len 32768 \ --max-num-batched-tokens 2048...
```bash INFO 03-03 20:36:02 [loader.py:422] Loading weights took 8.80 seconds (VllmWorkerProcess pid=1005428) INFO 03-03 20:36:02 [loader.py:422] Loading weights took 8.80 seconds (VllmWorkerProcess pid=1005426) INFO 03-03 20:36:02 [loader.py:422] Loading weights took...
VRAM is balance at the very begining.
@LucasWilkinson PTAL, thanks~
@LucasWilkinson some tests failed, but it seems not related to this PR.
weird, i don't why the mamba kernel tests failed ```bash FAILED kernels/test_mamba_ssm_ssd.py::test_mamba_chunk_scan_cont_batch[seq_len_chunk_size_cases0-5-8-itype0] AssertionError: chunk_indices and chunk_offsets should have been set ```
The mamba ssd kernel test failed, related PR https://github.com/vllm-project/vllm/pull/16623