DefTruth

Results 256 comments of DefTruth

Enabling chunked prefill and CUDA graph may lead to unbalanced VRAM usage.

> Please submit a minimal reproducible code or command. run DeepSeek-R1-Distill-Qwen-32B on L20x4: ```bash nohup python3 -m vllm.entrypoints.openai.api_server \ --model /workspace/hf_models/DeepSeek-R1-Distill-Qwen-32B \ --tensor-parallel-size 4 \ --max-model-len 32768 \ --max-num-batched-tokens 2048...

```bash INFO 03-03 20:36:02 [loader.py:422] Loading weights took 8.80 seconds (VllmWorkerProcess pid=1005428) INFO 03-03 20:36:02 [loader.py:422] Loading weights took 8.80 seconds (VllmWorkerProcess pid=1005426) INFO 03-03 20:36:02 [loader.py:422] Loading weights took...

@LucasWilkinson PTAL, thanks~

@LucasWilkinson some tests failed, but it seems not related to this PR.

weird, i don't why the mamba kernel tests failed ```bash FAILED kernels/test_mamba_ssm_ssd.py::test_mamba_chunk_scan_cont_batch[seq_len_chunk_size_cases0-5-8-itype0] AssertionError: chunk_indices and chunk_offsets should have been set ```

The mamba ssd kernel test failed, related PR https://github.com/vllm-project/vllm/pull/16623