text-generation-webui
text-generation-webui copied to clipboard
GPTQ-for-LLaMA and text-generation-webui version incompability
Describe the bug
Hello everyone. Help me understand what's going on. I installed text-generation-webui via a one-click script on Windows. Models run on GPU. Some models have a problem with generating gibberish output when using oobabooga's GPTQ-for-LLaMA. And when you install the original GPTQ-for-LLaMA (cuda) repository in t-g-webui, the token generation speed drops by 4 times (no installation errors tho), but the load on the GPU does not change and remains at 100%. At the same time, all models start to give a normal result. What to do about it?
List of used models:
TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g jeremy-costello/vicuna-13b-v1.1-4bit-128g
Here's some tests of these models.
t-g-webui (ooba-GPTQ (cuda) repository), default settings and prompt. 100% GPU load everywhere. "Hello, say something about yourself."
TheBloke's versions.
no-act-order.pt:
Output generated in 7.64 seconds (8.64 tokens/s, 66 tokens, context 41, seed 1229294013)
.safetensors:
(gibberish) output generated in 21.44 seconds (9.28 tokens/s, 199 tokens, context 41, seed 263917108)
jeremy-costello version.
vicuna-13b-v1.1-4bit-128g (.pt):
(gibberish) output generated in 20.82 seconds (9.56 tokens/s, 199 tokens, context 42, seed 1443451655)
t-g-webui (original GPTQ (cuda) repository), default settings and prompt. 100% GPU load everywhere. "Hello, say something about yourself."
TheBloke's versions.
no-act-order.pt:
Output generated in 15.99 seconds (2.13 tokens/s, 34 tokens, context 41, seed 198665102)
.safetensors:
Output generated in 21.75 seconds (1.98 tokens/s, 43 tokens, context 41, seed 863730907)
jeremy-costello version.
_vicuna-13b-v1.1-4bit-128g (.pt):
Output generated in 58.74 seconds (2.15 tokens/s, 126 tokens, context 41, seed 47935661)
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Install the text-generation-webui "in 1-click" on Windows via latest downloaded install.bat. Download these models via download-model.bat:
TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g (you need to download vicuna-13B-1.1-GPTQ-4bit-128g.no-act-order.pt manually from huggingface, but you need it purely for comparison)
jeremy-costello/vicuna-13b-v1.1-4bit-128g
Then just run them.
After that, delete all dependencies, change GPTQ-LLaMA repository to original in install.bat and reinstall (or update) everything via install.bat.
Then repeat.
Screenshot
Logs
There were no error logs.
System Info
Windows 10 22H2 build 19045.2788
AMD Ryzen 9 3900X, 32GB RAM
Nvidia Geforce GTX 1080 Ti
Yea, don't use act order with group size. Speed drop isn't worth it.
gibberish and cuda missing error, can be fixed with these instructions: (for windows, nvidia) install the newest ogaabogaa 1-click-installer then do this:
- open cmd_windows.bat
- pip uninstall quant-cuda
- cd text-generation-webui\repositories
- rm -f -d -r GPTQ-for-LLaMa
- git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda
- cd GPTQ-for-LLaMa
- python setup_cuda.py install
- close the cmd and run the start_windows.bat like normal
when i run this i get:
$ python setup_cuda.py install
Traceback (most recent call last):
File "/home/user/oobabooga_linux/text-generation-webui/repositories/GPTQ-for-LLaMa/setup_cuda.py", line 2, in
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.