vllm Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

When executing script examples/offline_inference_with_prefix.py, it will call context_attention_fwd from vllm.model_executor.layers.triton_kernel.prefix_prefill, which triggered the following error

python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

Platform :

V100
CUDA 12.0
python 3.11.6
vllm 0.3.0+cu120
triton 2.1.0
torch 2.1.2

related to #1669

Feb 02 '24 12:02 gty111

Same problem! Have you address this issue?

Mar 08 '24 01:03 eric8607242

Not yet.

Mar 08 '24 03:03 gty111

Same problem! Have you address this bug?

Mar 11 '24 07:03 ikushare

the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?

Mar 14 '24 08:03 ucasAliy

but when setting the prefix_pos<=15, it's running

Mar 14 '24 08:03 ucasAliy

the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?

Not yet.

but when setting the prefix_pos<=15, it's running

Since the block size is 16, VLLM won't cache prefix if prefix_pos<=15.

Mar 14 '24 08:03 gty111

I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.

Mar 15 '24 04:03 Qinyu-Xu

I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.

not work for me

Mar 15 '24 07:03 gty111

same issue on V100, any update to support V100?

Apr 01 '24 07:04 wenqf11

same problem with V100, is there a way to rely on the page attention kernel instead of the context_attention_fwd @caoshiyi ?

This might be a solution otherwise https://github.com/openai/triton/issues/1420#issuecomment-1485564996

Apr 03 '24 09:04 matthieu-zimmer

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

May 22 '24 12:05 matthieu-zimmer

same error on V100.

vllm: 0.5.1
pytorch: 2.3.0+cu121

Jul 09 '24 06:07 naturomics

Same error on V100. Ubuntu 22, GPU A100 32G, Python 3.10, cuda 12.1, vllm 0.5.0.post1 Trition 2.3.0 Python Code: max_model_len, tp_size = 131072, 1 model_name = "THUDM/glm-4-9b-chat" prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM( model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True, dtype='half', enable_chunked_prefill=True, max_num_batched_tokens=8192 ) stop_token_ids = [151329, 151336, 151338] sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text) Log: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-09 17:48:14 config.py:1222] Casting torch.bfloat16 to torch.float16. INFO 07-09 17:48:14 config.py:707] Chunked prefill is enabled (EXPERIMENTAL). INFO 07-09 17:48:14 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model/glm/glm-4-9b-chat) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-09 17:48:15 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 07-09 17:48:15 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 07-09 17:48:15 selector.py:51] Using XFormers backend. INFO 07-09 17:48:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 07-09 17:48:18 selector.py:51] Using XFormers backend. INFO 07-09 17:48:28 model_runner.py:160] Loading model weights took 17.5635 GB INFO 07-09 17:48:30 gpu_executor.py:83] # GPU blocks: 13067, # CPU blocks: 6553 Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]python: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

Jul 10 '24 03:07 wukonggeo

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

which version will work？ vllm==0.4.3 and triton==2.3.0 have the same error

Jul 12 '24 06:07 tricky61

same problem with Nvidia-v100.

vllm-0.5.3 
nvidia-nccl-cu12-2.20.5
outlines-0.0.46 
torch-2.3.1 
triton-2.3.1 
xformers-0.0.27

Aug 02 '24 09:08 geekchen007

Adding the following flag worked for me:

vllm serve ...  --enable-chunked-prefill=False

Aug 08 '24 15:08 grgkovac

There is a similar issue: https://github.com/vllm-project/vllm/issues/6723

On v100, --enable-chunked-prefill and --enable-prefix-caching might encounter this crash when doing serving, so these features might not be available on V100.

Aug 15 '24 03:08 Yang-x-Zhao

same issue here on v100 tesla 32gb

Sep 06 '24 21:09 K-Mistele

set enable_prefix_caching=False and enable_chunked_prefill=False worked in V100

Oct 03 '24 08:10 KeiLongW

Adding the following flag worked for me:
vllm serve ...  --enable-chunked-prefill=False

work for me, on v100s, vllm docker: vllm-openai-v0.6.3

Nov 27 '24 02:11 Ch3nYe

It looks like these can be set programmatically when calling vllm APIs in python code (e.g., no server per se) when creating the Engine:

    engine_args = vllm.AsyncEngineArgs(
          model=model_path,
          tokenizer=model_path,
          device="cuda",
          dtype="float16",
          enforce_eager=True,
          enable_chunked_prefill=False,
          enable_prefix_caching=False
        
        # tensor_parallel_size=self.config.tensor_parallel_size,
        # enforce_eager=self.config.enforce_eager,
        # gpu_memory_utilization=self.config.gpu_memory_utilization,
        # max_num_seqs=self.config.max_num_seqs,
        # max_model_len=self.config.max_model_len,
    )

Dec 19 '24 21:12 wnm3

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

hi, I meet the same problem on V100. Could u show how to solve it in detail. Is it reload input_tokens?

Feb 14 '25 07:02 hjunjie0324

same error

Mar 13 '25 05:03 yinbing668

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Jun 12 '25 02:06 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Jul 13 '25 02:07 github-actions[bot]

vllm vllm copied to clipboard

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

same problem with Nvidia-v100.

vllm
vllm copied to clipboard