GPTQ-for-LLaMa icon indicating copy to clipboard operation
GPTQ-for-LLaMa copied to clipboard

4 bits quantization of LLaMa using GPTQ

Results 96 GPTQ-for-LLaMa issues
Sort by recently updated
recently updated
newest added

Is it possible to run GPTQ on a machine that has only CPUs? If not, is there a plan for it?

does someone have compared the inference speed of 4bit quantized model with the origin FP16 model? is it faster than the origin FP16 model?

Mistral 7B is dominating the local LLM scene right now and your software doesn't load it. I need your software to work with it... Can we please make your software...

I followed the tutorial in the README to run the code,But when I run this sentence ```python3 CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --benchmark 2048...

https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/e985b700f19e670bad9b949cd83056889dd31448/neox.py#L302 This line needs "import math" on the head.

1. What changes would I need to make for GPTQ to support LoRa for Llama 2? 2. What's the main difference between GPTQ vs bitsandbytes? Is it that GPTQ re-adjusts...

``` CUDA_VISIBLE_DEVICES=0 python llama.py /mnt/g/models/conceptofmind_LLongMA-2-13b c4 --wbits 4 --true-sequential --act-order --groupsize 32 --save_safetensors /mnt/g/models/LLongMA-2-13b-16k-GPTQ/4bit-32g-tsao.safetensors Found cached dataset json (/home/anon/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Found cached dataset json (/home/anon/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Token indices sequence length is longer...

Why does the model quantization prompt KILLED at the end? ![无标题](https://github.com/qwopqwop200/GPTQ-for-LLaMa/assets/119348639/34c11719-cd98-4db9-80b7-e9589fba7296)