GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
4 bits quantization of LLaMa using GPTQ
From the research paper and the tables in the readme it looks like that group-size 64 is very effective in improving the quality of the models. Most noticable in the...
I was getting this error when running python setup_cuda.py quant_cuda_kernel.cu(149):` error: no instance of overloaded function "atomicAdd" matches the argument list argument types are: (double *, double) detected during instantiation...
Amazing work! Thank you so much for sharing this. Despite my attempts, I wasn't able to replicate the quantization functions without CUDA. It would be hugely helpful if users could...
Hi, When running `python setup_cuda.py install` I get the following error: ``` running install running bdist_egg running egg_info writing quant_cuda.egg-info\PKG-INFO writing dependency_links to quant_cuda.egg-info\dependency_links.txt writing top-level names to quant_cuda.egg-info\top_level.txt reading...
Presumably more than 130 GB of RAM? How much would it slow it down if using a swap file? Anything else? It seems like since GPTQ has the best results...
Without this change, building for devices with compute capability < 6.0 fails with: quant_cuda_kernel.cu(149): error: no instance of overloaded function "atomicAdd" matches the argument list argument types are: (double *,...
First a big thanks for this amazing effort! I was just trying to fine-tune this 4-bit model under the transformers framework. The model could be loaded successfully and the training...
Trying to get LLaMa 30B 4bit quantized to run with 12GB of vram and I'm hitting OOM since the model is a bit more than 16gb Is it possible to...
See: https://github.com/tloen/alpaca-lora/blob/main/generate.py Tried modifying the code to look like this, but no luck initially. from peft import PeftModel from transformers import LLaMATokenizer, LLaMAForCausalLM, GenerationConfig tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf") model = LLaMAForCausalLM.from_pretrained(...
python llama_inference.py ./llama-7b-hf --wbits 4 --load ./llama-7b-4bit.pt --text "this is llama" Loading model ... Done. Traceback (most recent call last): File "/root/GPTQ-for-LLaMa/llama_inference.py", line 114, in tokenizer = AutoTokenizer.from_pretrained(args.model) File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py",...