GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
Fused mlp causes assertion error
After c90adefbf1934f4638ea5c3bba8fc536aad3de57, when fused_mlp
is enabled, I got the following error:
python: /opt/conda/conda-bld/torchtriton_1677881345124/work/lib/Analysis/Allocation.cpp:42: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
Aborted (core dumped)
My gpu is 2080 Ti, which as a Turing, so I think it's not the same as #174
Same problem:
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --load llama7b-4bit-128g.pt --text "this is llama" --wbits 4 --groupsize 128
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:30<00:00, 2.52s/it]
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
0%| | 0/12 [00:00<?, ?it/s]
python: /project/lib/Analysis/Allocation.cpp:42: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
Aborted (core dumped)
I experience the same problem (identical error message) running 5168950
on 2080 Ti. Disabling fused_mlp
succeeds as a workaround for me.
I experience the same problem (identical error message) running
5168950
on 2080 Ti. Disablingfused_mlp
succeeds as a workaround for me.
hi man,how Disabling fused_mlp? my system is centos.
I experience the same problem (identical error message) running
5168950
on 2080 Ti. Disablingfused_mlp
succeeds as a workaround for me.hi man,how Disabling fused_mlp? my system is centos.
At line 279 in llama.py, change fused_mlp=True
in load_quant to fused_mlp=False
Same problem. Disabling fused_mlp works for me. Note: use .pt file, not .safetensors; for some reason .safetensors still triggers the error