GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
4 bits quantization of LLaMa using GPTQ
Could someone help me with **how to quantize my own model with GPTQ-for-LLaMA**? See screenshot of the output I am getting :cry: **Original full model**: https://huggingface.co/Glavin001/startup-interviews-13b-int4-2epochs-1 **Working quantized model with...
Got the same error as [issue 142](https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/142#issuecomment-1507778779) - AttributeError: module ‘triton.compiler’ has no attribute ‘OutOfResources’- after @geekypathak21's solution(see [PR 1505](https://github.com/openai/triton/pull/1505)) on getting around the problem of matmul issue of prevolta...
Hello everyone, Recently I noticed a lack of 4-bit quantized versions of `Google/flan-ul2` on HF, and so, decided to set out to quantize the model on my 4090. I struggled...
I have the following problem: `model=Honkware/openchat_8192-GPTQ ` `text-generation-launcher --model-id $model --num-shard 1 --quantize gptq --port 8080 ` ``` Traceback (most recent call last): File "/home/abalogh/anaconda3/envs/text-generation-inference/bin/text-generation-server", line 8, in sys.exit(app()) ^^^^^...
I tried to test GPTQ's PPL metrics on the opt model via opt.py. The PPL metrics of the opt model are normal with the use of fake quantization. However, when...
(textgen) quanlian@quanlian-System-Product-Name:~/aigc/text-generation-webui/repositories/GPTQ-for-LLaMa$ python setup_cuda.py install running install /home/quanlian/mambaforge/envs/textgen/lib/python3.10/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated. !! ******************************************************************************** Please avoid running ``setup.py`` directly. Instead, use pypa/build, pypa/installer, pypa/build or other standards-based tools. See...
I converted Llama weightsand quantized, but I got this error when I ran the inference. Could someone help me and let me know how I can fix it? Thanks! Here...
Hi, I ran the bloom.py using fp16 to test the perplexity (PPL) of BLOOM on Wikitext-2, PTB, and C4 datasets. The results are 11.79 / 20.14 / 17.68, which is...
when I run the script: CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --eval --save llama7b-4bit-128g.pt &>baseline.txt & I got the same ppl as readme,But when infer...
First: thanks for this implementation, I'm using it to load 7B models in my 8 GiB GPU using Ooba Gooba (Which fails to report how much memory did it use,...