Harry Mellor
Harry Mellor
> Same here with deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, running on single H100. But also occured with dual H100 in tp=2, and also with Q5/Q6 quantizations Any issues with distilled models are unrelated as...
@Abhishekbhagwat @fan-niu could one of you open an issue detailing the concurrency problem you're seeing?
@Icedcocon can you confirm that when not setting `--gpu-memory-utilization 1`, you are able to serve on 2 x 8 GPU nodes? As others have pointed out, the GPU memory profiling...
Closing as probably solved/stale
Given that the error seems to be coming from the custom ops, please try installing vLLM in a completely fresh Python environment. It's possible that there are old binaries lying...
Interestingly, when microbenchmarking these two ops, `math.prod` appears to be slightly faster for this use case too ```python import math import numpy as np import timeit # Test data: list...
Thank you for the PR. I have benchmarked this dedicated implementation against the Transformers backend and the performance gap is
Could you please provide the actual exact error and the command you used to trigger it?
I see, thank you for the error and the information! It looks like there might be a typo in the generation of the error https://github.com/vllm-project/flash-attention/blob/95898bad1d6b2c1668e39bcaa7ce70c38270e194/vllm_flash_attn/flash_attn_interface.py#L58-L64 making it confusing to the...
Unfortunately, it appears that Flash Attention requires compute cabability of 8 or above https://github.com/Dao-AILab/flash-attention/blob/a09abcd32d3cae4d83b313446e887f38d02b799f/csrc/flash_attn/flash_api.cpp#L368-L370 Since the V100 only has compute capability 7.0 it does not support Flash Attention. Currently, V1...