GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
4 bits quantization of LLaMa using GPTQ
I have tried quantizing galactica-30b with this command: ``` CUDA_VISIBLE_DEVICES=0 python opt.py /models/galactica-30b --wbits 4 --save galactica-30b-4bit.pt c4 ``` And then using it in the [web UI](https://github.com/oobabooga/text-generation-webui) with this one:...
I believe that we can achieve further optimisation beyond even 4bit quantization with selective quantization of specifically chosen layers down to 2bits. See: https://arxiv.org/abs/2203.08368 By selectively quantizing 50% of the...
The generation takes more time with each message, as if there's an overhead For example: The second response is 11x faster than the last response. They have the same number...
I test to install in nvidia docker, the build ninja includes incorrent sm_id like `-gencode arch=compute_52,code=sm_52` ``` # Install kernels python setup_cuda.py install ``` ``` cuda_post_cflags = -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__...
I get the following error when trying to run setup.py from gptq install. I have a RTX 3090 and followed instructions from [this ](https://gist.github.com/lxe/82eb87db25fdb75b92fa18a6d494ee3c) github gist `FAILED: D:/AI/text-generation-webui/repositories/GPTQ-for-LLaMa/build/temp.win-amd64-cpython-310/Release/quant_cuda_kernel.obj C:\Program Files\NVIDIA...
I'm running llama 65b on dual 3090s and at longer contexts I'm noticing seriously long context load times (the time between sending a prompt and tokens actually being received/streamed). It...
Small script to execute WinoGrande tests See details in the README
Facebook published posted expected results for the WinoGrande test with a score of 70 for the 7B model. I wrote a small script see #40 that fetches the dataset from...
CUDA_VISIBLE_DEVICES=0 python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --load llama7b-4bit.pt --text "this is llama" Loading model ... Done. Traceback (most recent call last): File "llama_inference.py", line 115, in generated_ids = model.generate( File...
Running into an error: `running bdist_egg running egg_info writing quant_cuda.egg-info\PKG-INFO writing dependency_links to quant_cuda.egg-info\dependency_links.txt writing top-level names to quant_cuda.egg-info\top_level.txt reading manifest file 'quant_cuda.egg-info\SOURCES.txt' writing manifest file 'quant_cuda.egg-info\SOURCES.txt' installing library code...