vllm [Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere

Your current environment

There are some related issues #2729 , #6723

The output of `python collect_env.py`

Deploy model on V100
Versions of relevant libraries:
[pip3] flashinfer==0.1.4+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.20
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.5@

🐛 Describe the bug

We use vLLM to startup deepseek-ai/deepseek-coder-33b-instruct on V100, meet Error follow as .

The current workaround is set to --enable-chunked-prefill=False, but this method is unknown to most users.

Does vLLM have plans to reimplement the a fwd kernel, support enable-chunked-prefill on V100?

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Aug 30 '24 06:08 brosoul

same issue, is there a workaround for this?

Sep 05 '24 12:09 khaerensml6

same here

Sep 06 '24 21:09 K-Mistele

INFO 10-26 04:51:50 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.2 tokens/s, Running: 1 reqs, Swapped: 1 reqs, Pending: 26 reqs, GPU KV cache usage: 63.6%, CPU KV cache usage: 97.2%. INFO 10-26 04:51:55 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.3 tokens/s, Running: 1 reqs, Swapped: 1 reqs, Pending: 26 reqs, GPU KV cache usage: 63.8%, CPU KV cache usage: 97.2%. python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir: :Attribute): Assertion !(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir: :Attribute): Assertion !(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. ERROR 10-26 04:52:09 client.py:244] TimeoutError('No heartbeat received from MQLLMEngine') ERROR 10-26 04:52:09 client.py:244] NoneType: None CRITICAL 10-26 04:52:15 launcher.py:99] MQLLMEngine is already dead, terminating server process CRITICAL 10-26 04:52:15 launcher.py:99] MQLLMEngine is already dead, terminating server process CRITICAL 10-26 04:52:15 launcher.py:99] MQLLMEngine is already dead, terminating server process

In version 0.6.2.The same issue occurred after the server had been running for several hours. The terminal displayed very slow throughput, combined with high CPU cache utilization and low GPU cache utilization. Additionally, setting --enable-chunked-prefill=False did not have any effect. By the way, during the runtime, I sent a lot of structured output requests and a few normal chat requests, but no more than four requests at a time. However, these requests ended up in the pending queue until the OpenAI server crashed

Oct 26 '24 03:10 hpx502766238

version 0.6.4.post1 has the same issue，setting --enable-chunked-prefill=False did not have any effect.

Nov 26 '24 09:11 Flynn-Zh

version 0.6.4.post1 has the same issue，setting --enable-chunked-prefill=False did not have any effect.

after removing the parameter '--enable-prefix-caching', this issue no longer occurs

Nov 26 '24 10:11 Flynn-Zh

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Feb 25 '25 02:02 github-actions[bot]

+1

Mar 06 '25 06:03 jiqiujia

same issue

Mar 13 '25 05:03 yinbing668

Also had this error, fixed by not enabling automatic prefix caching --enable-prefix-caching

Mar 26 '25 06:03 devdev999

Thanks for the solution!

Mar 27 '25 10:03 nctu6

@devdev999 's solution worked for me. Specifically, I used the --no-enable-prefix-caching flag

Apr 07 '25 15:04 harryli0088

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Jul 07 '25 02:07 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Aug 07 '25 02:08 github-actions[bot]