GPTQ-for-LLaMa issues

The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.7)

2

Hello, how does one fix this error: ``` $ python3 setup_cuda.py install running install /FastChat/.env/lib/python3.11/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( /FastChat/.env/lib/python3.11/site-packages/setuptools/command/easy_install.py:144:...

siddhsql

Sample code does not work

2

Thanks for the great work, here are errors from my side (one host with eight V100 GPUs): CUDA_VISIBLE_DEVICES=0 python llama_inference.py /home/xxx/models/hf_converted_llama/7B/ --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is...

foamliu

SqueezeLLM support?

https://github.com/SqueezeAILab/SqueezeLLM will GPTQ-for-LLaMa support this mode?

nikshepsvn

What is the right perplexity number?

for base FP16 model --eval gives 5.68 PPL on wikitext2 --benchmark 2048 gives 6.43 on wikitext2 What's the difference?

JianbangZ

T5 Benchmark

25

Thank you for the repo. I am curious what benchmark results (MMLU and BBH) we shall expect for the gptq-flan-t5 models. I am getting an average accuracy of 25.2% for...

ghost

Finetuning Quantized LLaMA

Hello, I really appreciate your work done here. I wonder if you could also release a python script on finetuning quantized LLaMA on a customized dataset. It is inevitable that...

Qifeng-Wu99

compare with llama.cpp int4 quantize?

luohao123

How to quantize bloom after lora/ptuning?

I finetuned bloom with loar and would like to quantize the model with GPTQ, ` self.model = AutoModelForCausalLM.from_pretrained( self.config['checkpoint_path'], device_map='auto', ) #load adpater self.model = PeftModelForCausalLM.from_pretrained(self.model, '/tmp/bloom_ori/lora_bloom')` some errors happened...

moonlightian

Fused mlp causes assertion error

5

After c90adefbf1934f4638ea5c3bba8fc536aad3de57, when `fused_mlp` is enabled, I got the following error: ``` python: /opt/conda/conda-bld/torchtriton_1677881345124/work/lib/Analysis/Allocation.cpp:42: std::pair mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"'...

sgsdxzy

Wondering whether some of the triton or cuda kernel also speedup fp16 or not?

I am not familiar with triton or cuda. But it feels like some code(fused_attm) can also be used in fp16 to gain inference speedup compared with huggingface?

drxmy

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard

Metadata

The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.7)

Sample code does not work

SqueezeLLM support?

What is the right perplexity number?

T5 Benchmark

Finetuning Quantized LLaMA

compare with llama.cpp int4 quantize?

How to quantize bloom after lora/ptuning?

Fused mlp causes assertion error

Wondering whether some of the triton or cuda kernel also speedup fp16 or not?

← Metadata

Owner

Metadata

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

Metadata

← Metadata

Owner

Metadata

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard