[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere
Your current environment
There are some related issues #2729 , #6723
The output of `python collect_env.py`
Deploy model on V100
Versions of relevant libraries:
[pip3] flashinfer==0.1.4+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.20
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.5@
🐛 Describe the bug
We use vLLM to startup deepseek-ai/deepseek-coder-33b-instruct on V100, meet Error follow as .
The current workaround is set to --enable-chunked-prefill=False, but this method is unknown to most users.
Does vLLM have plans to reimplement the a fwd kernel, support enable-chunked-prefill on V100?
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
same issue, is there a workaround for this?
same here
INFO 10-26 04:51:50 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.2 tokens/s, Running: 1 reqs, Swapped: 1 reqs, Pending: 26 reqs, GPU KV cache usage: 63.6%, CPU KV cache usage: 97.2%. INFO 10-26 04:51:55 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.3 tokens/s, Running: 1 reqs, Swapped: 1 reqs, Pending: 26 reqs, GPU KV cache usage: 63.8%, CPU KV cache usage: 97.2%. python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir: :Attribute): Assertion !(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. ERROR 10-26 04:52:09 client.py:244] TimeoutError('No heartbeat received from MQLLMEngine') ERROR 10-26 04:52:09 client.py:244] NoneType: None CRITICAL 10-26 04:52:15 launcher.py:99] MQLLMEngine is already dead, terminating server process CRITICAL 10-26 04:52:15 launcher.py:99] MQLLMEngine is already dead, terminating server process CRITICAL 10-26 04:52:15 launcher.py:99] MQLLMEngine is already dead, terminating server process
In version 0.6.2.The same issue occurred after the server had been running for several hours. The terminal displayed very slow throughput, combined with high CPU cache utilization and low GPU cache utilization. Additionally, setting --enable-chunked-prefill=False did not have any effect. By the way, during the runtime, I sent a lot of structured output requests and a few normal chat requests, but no more than four requests at a time. However, these requests ended up in the pending queue until the OpenAI server crashed
version 0.6.4.post1 has the same issue,setting --enable-chunked-prefill=False did not have any effect.
version 0.6.4.post1 has the same issue,setting --enable-chunked-prefill=False did not have any effect.
after removing the parameter '--enable-prefix-caching', this issue no longer occurs
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
+1
same issue
Also had this error, fixed by not enabling automatic prefix caching --enable-prefix-caching
Thanks for the solution!
@devdev999 's solution worked for me. Specifically, I used the --no-enable-prefix-caching flag
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!