GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
4 bits quantization of LLaMa using GPTQ
Hello, how does one fix this error: ``` $ python3 setup_cuda.py install running install /FastChat/.env/lib/python3.11/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( /FastChat/.env/lib/python3.11/site-packages/setuptools/command/easy_install.py:144:...
Thanks for the great work, here are errors from my side (one host with eight V100 GPUs): CUDA_VISIBLE_DEVICES=0 python llama_inference.py /home/xxx/models/hf_converted_llama/7B/ --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is...
https://github.com/SqueezeAILab/SqueezeLLM will GPTQ-for-LLaMa support this mode?
for base FP16 model --eval gives 5.68 PPL on wikitext2 --benchmark 2048 gives 6.43 on wikitext2 What's the difference?
T5 Benchmark
Thank you for the repo. I am curious what benchmark results (MMLU and BBH) we shall expect for the gptq-flan-t5 models. I am getting an average accuracy of 25.2% for...
Hello, I really appreciate your work done here. I wonder if you could also release a python script on finetuning quantized LLaMA on a customized dataset. It is inevitable that...
compare with llama.cpp int4 quantize?
I finetuned bloom with loar and would like to quantize the model with GPTQ, ` self.model = AutoModelForCausalLM.from_pretrained( self.config['checkpoint_path'], device_map='auto', ) #load adpater self.model = PeftModelForCausalLM.from_pretrained(self.model, '/tmp/bloom_ori/lora_bloom')` some errors happened...
After c90adefbf1934f4638ea5c3bba8fc536aad3de57, when `fused_mlp` is enabled, I got the following error: ``` python: /opt/conda/conda-bld/torchtriton_1677881345124/work/lib/Analysis/Allocation.cpp:42: std::pair mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"'...
I am not familiar with triton or cuda. But it feels like some code(fused_attm) can also be used in fp16 to gain inference speedup compared with huggingface?