text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

GPTQ-for-LLaMA and text-generation-webui version incompability

Open Morivy opened this issue 1 year ago • 3 comments

Describe the bug

Hello everyone. Help me understand what's going on. I installed text-generation-webui via a one-click script on Windows. Models run on GPU. Some models have a problem with generating gibberish output when using oobabooga's GPTQ-for-LLaMA. And when you install the original GPTQ-for-LLaMA (cuda) repository in t-g-webui, the token generation speed drops by 4 times (no installation errors tho), but the load on the GPU does not change and remains at 100%. At the same time, all models start to give a normal result. What to do about it?

List of used models:

TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g jeremy-costello/vicuna-13b-v1.1-4bit-128g

Here's some tests of these models.

t-g-webui (ooba-GPTQ (cuda) repository), default settings and prompt. 100% GPU load everywhere. "Hello, say something about yourself."

TheBloke's versions.

no-act-order.pt:

Output generated in 7.64 seconds (8.64 tokens/s, 66 tokens, context 41, seed 1229294013)

.safetensors:

(gibberish) output generated in 21.44 seconds (9.28 tokens/s, 199 tokens, context 41, seed 263917108)

jeremy-costello version.

vicuna-13b-v1.1-4bit-128g (.pt):

(gibberish) output generated in 20.82 seconds (9.56 tokens/s, 199 tokens, context 42, seed 1443451655)

t-g-webui (original GPTQ (cuda) repository), default settings and prompt. 100% GPU load everywhere. "Hello, say something about yourself."

TheBloke's versions.

no-act-order.pt:

Output generated in 15.99 seconds (2.13 tokens/s, 34 tokens, context 41, seed 198665102)

.safetensors:

Output generated in 21.75 seconds (1.98 tokens/s, 43 tokens, context 41, seed 863730907)

jeremy-costello version.

_vicuna-13b-v1.1-4bit-128g (.pt):

Output generated in 58.74 seconds (2.15 tokens/s, 126 tokens, context 41, seed 47935661)

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

Install the text-generation-webui "in 1-click" on Windows via latest downloaded install.bat. Download these models via download-model.bat:

TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g (you need to download vicuna-13B-1.1-GPTQ-4bit-128g.no-act-order.pt manually from huggingface, but you need it purely for comparison) jeremy-costello/vicuna-13b-v1.1-4bit-128g

Then just run them.

After that, delete all dependencies, change GPTQ-LLaMA repository to original in install.bat and reinstall (or update) everything via install.bat.

Then repeat.

Screenshot

image image image image image image

Logs

There were no error logs.

System Info

Windows 10 22H2 build 19045.2788
AMD Ryzen 9 3900X, 32GB RAM
Nvidia Geforce GTX 1080 Ti

Morivy avatar Apr 16 '23 21:04 Morivy

Yea, don't use act order with group size. Speed drop isn't worth it.

Ph0rk0z avatar Apr 16 '23 23:04 Ph0rk0z

gibberish and cuda missing error, can be fixed with these instructions: (for windows, nvidia) install the newest ogaabogaa 1-click-installer then do this:

  1. open cmd_windows.bat
  2. pip uninstall quant-cuda
  3. cd text-generation-webui\repositories
  4. rm -f -d -r GPTQ-for-LLaMa
  5. git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda
  6. cd GPTQ-for-LLaMa
  7. python setup_cuda.py install
  8. close the cmd and run the start_windows.bat like normal

madwurmz avatar May 01 '23 05:05 madwurmz

when i run this i get:

$ python setup_cuda.py install Traceback (most recent call last): File "/home/user/oobabooga_linux/text-generation-webui/repositories/GPTQ-for-LLaMa/setup_cuda.py", line 2, in from torch.utils import cpp_extension

agonzalezm avatar May 11 '23 06:05 agonzalezm

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

github-actions[bot] avatar Aug 31 '23 23:08 github-actions[bot]