GPTQ-for-LLaMa icon indicating copy to clipboard operation
GPTQ-for-LLaMa copied to clipboard

4 bits quantization of LLaMa using GPTQ

Results 96 GPTQ-for-LLaMa issues
Sort by recently updated
recently updated
newest added

Hello, how does one fix this error: ``` $ python3 setup_cuda.py install running install /FastChat/.env/lib/python3.11/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( /FastChat/.env/lib/python3.11/site-packages/setuptools/command/easy_install.py:144:...

Thanks for the great work, here are errors from my side (one host with eight V100 GPUs): CUDA_VISIBLE_DEVICES=0 python llama_inference.py /home/xxx/models/hf_converted_llama/7B/ --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is...

https://github.com/SqueezeAILab/SqueezeLLM will GPTQ-for-LLaMa support this mode?

for base FP16 model --eval gives 5.68 PPL on wikitext2 --benchmark 2048 gives 6.43 on wikitext2 What's the difference?

Thank you for the repo. I am curious what benchmark results (MMLU and BBH) we shall expect for the gptq-flan-t5 models. I am getting an average accuracy of 25.2% for...

Hello, I really appreciate your work done here. I wonder if you could also release a python script on finetuning quantized LLaMA on a customized dataset. It is inevitable that...

compare with llama.cpp int4 quantize?

I finetuned bloom with loar and would like to quantize the model with GPTQ, ` self.model = AutoModelForCausalLM.from_pretrained( self.config['checkpoint_path'], device_map='auto', ) #load adpater self.model = PeftModelForCausalLM.from_pretrained(self.model, '/tmp/bloom_ori/lora_bloom')` some errors happened...

After c90adefbf1934f4638ea5c3bba8fc536aad3de57, when `fused_mlp` is enabled, I got the following error: ``` python: /opt/conda/conda-bld/torchtriton_1677881345124/work/lib/Analysis/Allocation.cpp:42: std::pair mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"'...

I am not familiar with triton or cuda. But it feels like some code(fused_attm) can also be used in fp16 to gain inference speedup compared with huggingface?