GPTQ-for-LLaMa Fused mlp causes assertion error

After c90adefbf1934f4638ea5c3bba8fc536aad3de57, when fused_mlp is enabled, I got the following error:

python: /opt/conda/conda-bld/torchtriton_1677881345124/work/lib/Analysis/Allocation.cpp:42: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
Aborted (core dumped)

My gpu is 2080 Ti, which as a Turing, so I think it's not the same as #174

Apr 15 '23 12:04 sgsdxzy

Same problem：

CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --load llama7b-4bit-128g.pt --text "this is llama" --wbits 4 --groupsize 128
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:30<00:00,  2.52s/it]
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
  0%|                                                                                                                                                                                 | 0/12 [00:00<?, ?it/s]
python: /project/lib/Analysis/Allocation.cpp:42: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
Aborted (core dumped)

Apr 25 '23 08:04 TitanSneaker

I experience the same problem (identical error message) running 5168950 on 2080 Ti. Disabling fused_mlp succeeds as a workaround for me.

Apr 27 '23 21:04 penlu

I experience the same problem (identical error message) running 5168950 on 2080 Ti. Disabling fused_mlp succeeds as a workaround for me.

hi man，how Disabling fused_mlp? my system is centos.

May 31 '23 06:05 929359291

I experience the same problem (identical error message) running 5168950 on 2080 Ti. Disabling fused_mlp succeeds as a workaround for me.

hi man，how Disabling fused_mlp? my system is centos.

At line 279 in llama.py, change fused_mlp=True in load_quant to fused_mlp=False

Jun 01 '23 13:06 ereish64

Same problem. Disabling fused_mlp works for me. Note: use .pt file, not .safetensors; for some reason .safetensors still triggers the error

Jun 02 '23 12:06 shirley-wu

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

Fused mlp causes assertion error

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard