vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere

Open brosoul opened this issue 1 year ago • 7 comments

Your current environment

There are some related issues #2729 , #6723

The output of `python collect_env.py`
Deploy model on V100
Versions of relevant libraries:
[pip3] flashinfer==0.1.4+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.20
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.5@

🐛 Describe the bug

We use vLLM to startup deepseek-ai/deepseek-coder-33b-instruct on V100, meet Error follow as . image

The current workaround is set to --enable-chunked-prefill=False, but this method is unknown to most users.

Does vLLM have plans to reimplement the a fwd kernel, support enable-chunked-prefill on V100?

Before submitting a new issue...

  • [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

brosoul avatar Aug 30 '24 06:08 brosoul

same issue, is there a workaround for this?

khaerensml6 avatar Sep 05 '24 12:09 khaerensml6

same here

K-Mistele avatar Sep 06 '24 21:09 K-Mistele

INFO 10-26 04:51:50 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.2 tokens/s, Running: 1 reqs, Swapped: 1 reqs, Pending: 26 reqs, GPU KV cache usage: 63.6%, CPU KV cache usage: 97.2%. INFO 10-26 04:51:55 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.3 tokens/s, Running: 1 reqs, Swapped: 1 reqs, Pending: 26 reqs, GPU KV cache usage: 63.8%, CPU KV cache usage: 97.2%. python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir: :Attribute): Assertion !(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir: :Attribute): Assertion !(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. ERROR 10-26 04:52:09 client.py:244] TimeoutError('No heartbeat received from MQLLMEngine') ERROR 10-26 04:52:09 client.py:244] NoneType: None CRITICAL 10-26 04:52:15 launcher.py:99] MQLLMEngine is already dead, terminating server process CRITICAL 10-26 04:52:15 launcher.py:99] MQLLMEngine is already dead, terminating server process CRITICAL 10-26 04:52:15 launcher.py:99] MQLLMEngine is already dead, terminating server process

In version 0.6.2.The same issue occurred after the server had been running for several hours. The terminal displayed very slow throughput, combined with high CPU cache utilization and low GPU cache utilization. Additionally, setting --enable-chunked-prefill=False did not have any effect. By the way, during the runtime, I sent a lot of structured output requests and a few normal chat requests, but no more than four requests at a time. However, these requests ended up in the pending queue until the OpenAI server crashed

hpx502766238 avatar Oct 26 '24 03:10 hpx502766238

version 0.6.4.post1 has the same issue,setting --enable-chunked-prefill=False did not have any effect.

Flynn-Zh avatar Nov 26 '24 09:11 Flynn-Zh

version 0.6.4.post1 has the same issue,setting --enable-chunked-prefill=False did not have any effect.

after removing the parameter '--enable-prefix-caching', this issue no longer occurs

Flynn-Zh avatar Nov 26 '24 10:11 Flynn-Zh

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Feb 25 '25 02:02 github-actions[bot]

+1

jiqiujia avatar Mar 06 '25 06:03 jiqiujia

same issue

yinbing668 avatar Mar 13 '25 05:03 yinbing668

Also had this error, fixed by not enabling automatic prefix caching --enable-prefix-caching

devdev999 avatar Mar 26 '25 06:03 devdev999

Thanks for the solution!

nctu6 avatar Mar 27 '25 10:03 nctu6

@devdev999 's solution worked for me. Specifically, I used the --no-enable-prefix-caching flag

harryli0088 avatar Apr 07 '25 15:04 harryli0088

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jul 07 '25 02:07 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Aug 07 '25 02:08 github-actions[bot]