GPTQ-for-LLaMa issues

Quantizing GALACTICA?

I have tried quantizing galactica-30b with this command: ``` CUDA_VISIBLE_DEVICES=0 python opt.py /models/galactica-30b --wbits 4 --save galactica-30b-4bit.pt c4 ``` And then using it in the [web UI](https://github.com/oobabooga/text-generation-webui) with this one:...

oobabooga

[Request] Mixed Precission Quantization

1

I believe that we can achieve further optimisation beyond even 4bit quantization with selective quantization of specifically chosen layers down to 2bits. See: https://arxiv.org/abs/2203.08368 By selectively quantizing 50% of the...

elephantpanda

4-bit llama gets progressively slower with each text generation

6

The generation takes more time with each message, as if there's an overhead For example: The second response is 11x faster than the last response. They have the same number...

1aienthusiast

cuda extension problem

5

I test to install in nvidia docker, the build ninja includes incorrent sm_id like `-gencode arch=compute_52,code=sm_52` ``` # Install kernels python setup_cuda.py install ``` ``` cuda_post_cflags = -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__...

WuNein

Nvcc fatal : Unsupported gpu architecture 'compute_86'

5

I get the following error when trying to run setup.py from gptq install. I have a RTX 3090 and followed instructions from [this ](https://gist.github.com/lxe/82eb87db25fdb75b92fa18a6d494ee3c) github gist `FAILED: D:/AI/text-generation-webui/repositories/GPTQ-for-LLaMa/build/temp.win-amd64-cpython-310/Release/quant_cuda_kernel.obj C:\Program Files\NVIDIA...

DamonianoStudios

Lonnnnnnnnng context load time before generation

5

I'm running llama 65b on dual 3090s and at longer contexts I'm noticing seriously long context load times (the time between sending a prompt and tokens actually being received/streamed). It...

generic-username0718

Script to execute Winogrande test

1

Small script to execute WinoGrande tests See details in the README

DanielWe2

Bad results for WinoGrande - more testers searched

1

Facebook published posted expected results for the WinoGrande test with a score of 70 for the 7B model. I wrote a small script see #40 that fetches the dataset from...

DanielWe2

probability tensor contains either `inf`, `nan` or element < 0

CUDA_VISIBLE_DEVICES=0 python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --load llama7b-4bit.pt --text "this is llama" Loading model ... Done. Traceback (most recent call last): File "llama_inference.py", line 115, in generated_ids = model.generate( File...

Minami-su

running build_exit error

Running into an error: `running bdist_egg running egg_info writing quant_cuda.egg-info\PKG-INFO writing dependency_links to quant_cuda.egg-info\dependency_links.txt writing top-level names to quant_cuda.egg-info\top_level.txt reading manifest file 'quant_cuda.egg-info\SOURCES.txt' writing manifest file 'quant_cuda.egg-info\SOURCES.txt' installing library code...

BenjaminHei

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard

Metadata

Quantizing GALACTICA?

[Request] Mixed Precission Quantization

4-bit llama gets progressively slower with each text generation

cuda extension problem

Nvcc fatal : Unsupported gpu architecture 'compute_86'

Lonnnnnnnnng context load time before generation

Script to execute Winogrande test

Bad results for WinoGrande - more testers searched

probability tensor contains either `inf`, `nan` or element < 0

running build_exit error

← Metadata

Owner

Metadata

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

Metadata

← Metadata

Owner

Metadata

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard