GPTQ-for-LLaMa issues

Errors encountered when running benchmark FP16 baseline on multiple GPUs

2

Trying to run FP16 baseline benchmark for LLaMA 30B model on a server with 8 V100 32GB GPUs： CUDA_VISIBLE_DEVICES=0,1 python llama.py /dev/shm/ly/models/hf_converted_llama/30B/ wikitext2 --benchmark 2048 --check Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|...

foamliu

Does this work for gptj specifically the cuda branch? Thanks!

Hey looking through most of the issues and code don't see references of gpt-j wondering if this supports pyg 6B

ArEnSc

Does not support 3bit quantization?

3-bit quant of a 65B model, encoutered following error during pack stage： ![image](https://github.com/qwopqwop200/GPTQ-for-LLaMa/assets/14009435/3a5a961a-05a9-4210-b188-20c7a5d5f179)

foamliu

Unable to run 'python setup_cuda.py install'

Hello, I'm attempting to execute the following steps in this document: https://huggingface.co/blog/chatbot-amd-gpu But I get stuck at this point: Unable to run 'python setup_cuda.py install' It seems like something is...

ghost

running on old gpu with fp32 only

3

I'm using old `p40` , which seems not supporting fp16 I tried to latest triton branch, and compile triton from master. the inference code shows something like ```bash error: invalid...

DeoLeung

Build issue with newer torch pybind11 cast.h - workaround inside

This may be out of the scope of current development, but the workaround of patching cast.h as described in this issue allows builds for me https://github.com/pybind/pybind11/issues/4606 I realize we can't...

ilikenwf

OpenCL support

1

Please add OpenCL Support that so that it can be used on GPU's that Support OpenCL and not CUDA

apcameron

6-bit quantization

1

For smaller models, quantization causes more quality loss than large models. Could the repository try 6-bit / 128 groups for stuff like LLaMa-7B? This could be most useful for some...

philipturner

CUDA: 8bit quantized models are stupid.

4

I have tried to quantize some models to 8bit after seeing scores for Q4. The models produced appear coherent but get Wikitext evaluations like 2000 or 5000. In contrast, the...

Ph0rk0z

no module named quant_cuda (fastest-inference-4bit branch)

1

Issue: no module named quant_cuda Branch: fastest-inference-4bit branch After what seems to be proper install, I get the error above when I try "import quant" or "import quant_cuda". As a...

joshlevy89

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard

Metadata

Errors encountered when running benchmark FP16 baseline on multiple GPUs

Does this work for gptj specifically the cuda branch? Thanks!

Does not support 3bit quantization?

Unable to run 'python setup_cuda.py install'

running on old gpu with fp32 only

Build issue with newer torch pybind11 cast.h - workaround inside

OpenCL support

6-bit quantization

CUDA: 8bit quantized models are stupid.

no module named quant_cuda (fastest-inference-4bit branch)

← Metadata

Owner

Metadata

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

Metadata

← Metadata

Owner

Metadata

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard