vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

Open gty111 opened this issue 1 year ago • 21 comments

When executing script examples/offline_inference_with_prefix.py, it will call context_attention_fwd from vllm.model_executor.layers.triton_kernel.prefix_prefill, which triggered the following error

python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

Platform :

  • V100
  • CUDA 12.0
  • python 3.11.6
  • vllm 0.3.0+cu120
  • triton 2.1.0
  • torch 2.1.2

related to #1669

gty111 avatar Feb 02 '24 12:02 gty111

Same problem! Have you address this issue?

eric8607242 avatar Mar 08 '24 01:03 eric8607242

Not yet.

gty111 avatar Mar 08 '24 03:03 gty111

Same problem! Have you address this bug?

ikushare avatar Mar 11 '24 07:03 ikushare

the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?

ucasAliy avatar Mar 14 '24 08:03 ucasAliy

but when setting the prefix_pos<=15, it's running

ucasAliy avatar Mar 14 '24 08:03 ucasAliy

the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?

Not yet.

but when setting the prefix_pos<=15, it's running

Since the block size is 16, VLLM won't cache prefix if prefix_pos<=15.

gty111 avatar Mar 14 '24 08:03 gty111

I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.

Qinyu-Xu avatar Mar 15 '24 04:03 Qinyu-Xu

I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.

not work for me

gty111 avatar Mar 15 '24 07:03 gty111

same issue on V100, any update to support V100?

wenqf11 avatar Apr 01 '24 07:04 wenqf11

same problem with V100, is there a way to rely on the page attention kernel instead of the context_attention_fwd @caoshiyi ?

This might be a solution otherwise https://github.com/openai/triton/issues/1420#issuecomment-1485564996

matthieu-zimmer avatar Apr 03 '24 09:04 matthieu-zimmer

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

matthieu-zimmer avatar May 22 '24 12:05 matthieu-zimmer

same error on V100.

  • vllm: 0.5.1
  • pytorch: 2.3.0+cu121

naturomics avatar Jul 09 '24 06:07 naturomics

Same error on V100. Ubuntu 22, GPU A100 32G, Python 3.10, cuda 12.1, vllm 0.5.0.post1 Trition 2.3.0 Python Code: max_model_len, tp_size = 131072, 1 model_name = "THUDM/glm-4-9b-chat" prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM( model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True, dtype='half', enable_chunked_prefill=True, max_num_batched_tokens=8192 ) stop_token_ids = [151329, 151336, 151338] sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text) Log: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-09 17:48:14 config.py:1222] Casting torch.bfloat16 to torch.float16. INFO 07-09 17:48:14 config.py:707] Chunked prefill is enabled (EXPERIMENTAL). INFO 07-09 17:48:14 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model/glm/glm-4-9b-chat) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-09 17:48:15 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 07-09 17:48:15 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 07-09 17:48:15 selector.py:51] Using XFormers backend. INFO 07-09 17:48:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 07-09 17:48:18 selector.py:51] Using XFormers backend. INFO 07-09 17:48:28 model_runner.py:160] Loading model weights took 17.5635 GB INFO 07-09 17:48:30 gpu_executor.py:83] # GPU blocks: 13067, # CPU blocks: 6553 Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]python: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

wukonggeo avatar Jul 10 '24 03:07 wukonggeo

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

which version will work? vllm==0.4.3 and triton==2.3.0 have the same error

tricky61 avatar Jul 12 '24 06:07 tricky61

similar problem with v100.

Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

python==3.10 triton==2.3.0 vllm==0.5.1

ZG2017 avatar Jul 18 '24 16:07 ZG2017

This problem arises from https://github.com/triton-lang/triton/pull/2627/files . vllm implemented a fwd kernel in prefix_prefill.py, thus triggering this issue. I think we should modify the _fwd_kernel in vllm/vllm/attention/ops/prefix_prefill.py.

Yang-x-Zhao avatar Jul 26 '24 03:07 Yang-x-Zhao

same problem with Nvidia-v100.

vllm-0.5.3 
nvidia-nccl-cu12-2.20.5
outlines-0.0.46 
torch-2.3.1 
triton-2.3.1 
xformers-0.0.27

geekchen007 avatar Aug 02 '24 09:08 geekchen007

Adding the following flag worked for me:

vllm serve ...  --enable-chunked-prefill=False

grgkovac avatar Aug 08 '24 15:08 grgkovac

There is a similar issue: https://github.com/vllm-project/vllm/issues/6723

On v100, --enable-chunked-prefill and --enable-prefix-caching might encounter this crash when doing serving, so these features might not be available on V100.

Yang-x-Zhao avatar Aug 15 '24 03:08 Yang-x-Zhao

same issue here on v100 tesla 32gb

K-Mistele avatar Sep 06 '24 21:09 K-Mistele

set enable_prefix_caching=False and enable_chunked_prefill=False worked in V100

KeiLongW avatar Oct 03 '24 08:10 KeiLongW

Adding the following flag worked for me:

vllm serve ...  --enable-chunked-prefill=False

work for me, on v100s, vllm docker: vllm-openai-v0.6.3

Ch3nYe avatar Nov 27 '24 02:11 Ch3nYe

It looks like these can be set programmatically when calling vllm APIs in python code (e.g., no server per se) when creating the Engine:

    engine_args = vllm.AsyncEngineArgs(
          model=model_path,
          tokenizer=model_path,
          device="cuda",
          dtype="float16",
          enforce_eager=True,
          enable_chunked_prefill=False,
          enable_prefix_caching=False
        
        # tensor_parallel_size=self.config.tensor_parallel_size,
        # enforce_eager=self.config.enforce_eager,
        # gpu_memory_utilization=self.config.gpu_memory_utilization,
        # max_num_seqs=self.config.max_num_seqs,
        # max_model_len=self.config.max_model_len,
    )

wnm3 avatar Dec 19 '24 21:12 wnm3

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

hi, I meet the same problem on V100. Could u show how to solve it in detail. Is it reload input_tokens?

hjunjie0324 avatar Feb 14 '25 07:02 hjunjie0324

same error

yinbing668 avatar Mar 13 '25 05:03 yinbing668

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jun 12 '25 02:06 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Jul 13 '25 02:07 github-actions[bot]