GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
4 bits quantization of LLaMa using GPTQ
Trying to run FP16 baseline benchmark for LLaMA 30B model on a server with 8 V100 32GB GPUs: CUDA_VISIBLE_DEVICES=0,1 python llama.py /dev/shm/ly/models/hf_converted_llama/30B/ wikitext2 --benchmark 2048 --check Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|...
Hey looking through most of the issues and code don't see references of gpt-j wondering if this supports pyg 6B
3-bit quant of a 65B model, encoutered following error during pack stage: 
Hello, I'm attempting to execute the following steps in this document: https://huggingface.co/blog/chatbot-amd-gpu But I get stuck at this point: Unable to run 'python setup_cuda.py install' It seems like something is...
I'm using old `p40` , which seems not supporting fp16 I tried to latest triton branch, and compile triton from master. the inference code shows something like ```bash error: invalid...
This may be out of the scope of current development, but the workaround of patching cast.h as described in this issue allows builds for me https://github.com/pybind/pybind11/issues/4606 I realize we can't...
Please add OpenCL Support that so that it can be used on GPU's that Support OpenCL and not CUDA
For smaller models, quantization causes more quality loss than large models. Could the repository try 6-bit / 128 groups for stuff like LLaMa-7B? This could be most useful for some...
I have tried to quantize some models to 8bit after seeing scores for Q4. The models produced appear coherent but get Wikitext evaluations like 2000 or 5000. In contrast, the...
Issue: no module named quant_cuda Branch: fastest-inference-4bit branch After what seems to be proper install, I get the error above when I try "import quant" or "import quant_cuda". As a...