GPTQ-for-LLaMa issues

Questions about group size

6

From the research paper and the tables in the readme it looks like that group-size 64 is very effective in improving the quality of the models. Most noticable in the...

DanielWe2

Change ints to double in quant_cuda_kernel.cu?

6

I was getting this error when running python setup_cuda.py quant_cuda_kernel.cu(149):` error: no instance of overloaded function "atomicAdd" matches the argument list argument types are: (double *, double) detected during instantiation...

xiscoding

Request: Optional non-CUDA version

5

Amazing work! Thank you so much for sharing this. Despite my attempts, I wasn't able to replicate the quantization functions without CUDA. It would be hugely helpful if users could...

richardburleigh

Problem with setup_cuda.py install

11

Hi, When running `python setup_cuda.py install` I get the following error: ``` running install running bdist_egg running egg_info writing quant_cuda.egg-info\PKG-INFO writing dependency_links to quant_cuda.egg-info\dependency_links.txt writing top-level names to quant_cuda.egg-info\top_level.txt reading...

farrael004

What would be required to quantize 65B model to 2-bit?

2

Presumably more than 130 GB of RAM? How much would it slow it down if using a swap file? Anything else? It seems like since GPTQ has the best results...

Alcyon6

Add support for devices with compute capability < 6.0

Without this change, building for devices with compute capability < 6.0 fails with: quant_cuda_kernel.cu(149): error: no instance of overloaded function "atomicAdd" matches the argument list argument types are: (double *,...

tobbez

How to fine-tune the 4-bit model?

2

First a big thanks for this amazing effort! I was just trying to fine-tune this 4-bit model under the transformers framework. The model could be loaded successfully and the training...

zsun227

GPTQ+flexgen, is it possible?

4

Trying to get LLaMa 30B 4bit quantized to run with 12GB of vram and I'm hitting OOM since the model is a bit more than 16gb Is it possible to...

ye7iaserag

Will loras work with this?

See: https://github.com/tloen/alpaca-lora/blob/main/generate.py Tried modifying the code to look like this, but no luck initially. from peft import PeftModel from transformers import LLaMATokenizer, LLaMAForCausalLM, GenerationConfig tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf") model = LLaMAForCausalLM.from_pretrained(...

fblissjr

llama_inference RuntimeError: Internal: src/sentencepiece_processor.cc

python llama_inference.py ./llama-7b-hf --wbits 4 --load ./llama-7b-4bit.pt --text "this is llama" Loading model ... Done. Traceback (most recent call last): File "/root/GPTQ-for-LLaMa/llama_inference.py", line 114, in tokenizer = AutoTokenizer.from_pretrained(args.model) File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py",...

youkpan

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard

Metadata

Questions about group size

Change ints to double in quant_cuda_kernel.cu?

Request: Optional non-CUDA version

Problem with setup_cuda.py install

What would be required to quantize 65B model to 2-bit?

Add support for devices with compute capability < 6.0

How to fine-tune the 4-bit model?

GPTQ+flexgen, is it possible?

Will loras work with this?

llama_inference RuntimeError: Internal: src/sentencepiece_processor.cc

← Metadata

Owner

Metadata

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

Metadata

← Metadata

Owner

Metadata

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard