Harry Mellor comments

Results 298 comments of


                                            Harry Mellor

[Usage]: Context Size Limitation and CUDA OOM with DeepSeek R1 on 2 Nodes (TP8 PP2, 16 GPUs with 141GB VRAM Each)

> Same here with deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, running on single H100. But also occured with dual H100 in tp=2, and also with Q5/Q6 quantizations Any issues with distilled models are unrelated as...

[Usage]: Context Size Limitation and CUDA OOM with DeepSeek R1 on 2 Nodes (TP8 PP2, 16 GPUs with 141GB VRAM Each)

@Abhishekbhagwat @fan-niu could one of you open an issue detailing the concurrency problem you're seeing?

[Usage]: Context Size Limitation and CUDA OOM with DeepSeek R1 on 2 Nodes (TP8 PP2, 16 GPUs with 141GB VRAM Each)

@Icedcocon can you confirm that when not setting `--gpu-memory-utilization 1`, you are able to serve on 2 x 8 GPU nodes? As others have pointed out, the GPU memory profiling...

[Usage]: Context Size Limitation and CUDA OOM with DeepSeek R1 on 2 Nodes (TP8 PP2, 16 GPUs with 141GB VRAM Each)

Closing as probably solved/stale

[Usage]: Context Size Limitation and CUDA OOM with DeepSeek R1 on 2 Nodes (TP8 PP2, 16 GPUs with 141GB VRAM Each)

Given that the error seems to be coming from the custom ops, please try installing vLLM in a completely fresh Python environment. It's possible that there are old binaries lying...

[Feature]: Use math.prod instead of np.prod for trivial ops

Interestingly, when microbenchmarking these two ops, `math.prod` appears to be slightly faster for this use case too ```python import math import numpy as np import timeit # Test data: list...

add smollm3 support

Thank you for the PR. I have benchmarked this dedicated implementation against the Transformers backend and the performance gap is

[Bug]: VLLM V1架构不支持V100显卡

Could you please provide the actual exact error and the command you used to trigger it?

[Bug]: VLLM V1架构不支持V100显卡

I see, thank you for the error and the information! It looks like there might be a typo in the generation of the error https://github.com/vllm-project/flash-attention/blob/95898bad1d6b2c1668e39bcaa7ce70c38270e194/vllm_flash_attn/flash_attn_interface.py#L58-L64 making it confusing to the...

[Bug]: VLLM V1架构不支持V100显卡

Unfortunately, it appears that Flash Attention requires compute cabability of 8 or above https://github.com/Dao-AILab/flash-attention/blob/a09abcd32d3cae4d83b313446e887f38d02b799f/csrc/flash_attn/flash_api.cpp#L368-L370 Since the V100 only has compute capability 7.0 it does not support Flash Attention. Currently, V1...