text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Wont load quantized models

Open sharp-trickster opened this issue 1 year ago • 8 comments

Describe the bug

Downloaded OPT, Galactica and even CodeGEN (not listed on Downloads, from huggingface), and they work fine. Tried the same with 4 bit quantized models (vicuna-13b-GPTQ-4bit-128g and gpt4-x-alpaca-13b-native-4bit-128g) but it throws a "CUDA not installed" error at me, while the other models run just fine.

I have a copy of the start-webui bat file with --wbits 4 --groupsize 128 added to the call python server.py --auto-devices --chat line.

The error reads:

`Loading vicuna-13b-GPTQ-4bit-128g... CUDA extension not installed. Found the following quantized model: models\vicuna-13b-GPTQ-4bit-128g\vicuna-13b-4bit-128g.safetensors Loading model ... C:\dev\oobabooga-windows\installer_files\env\lib\site-packages\safetensors\torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() with safe_open(filename, framework="pt", device=device) as f: C:\dev\oobabooga-windows\installer_files\env\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() C:\dev\oobabooga-windows\installer_files\env\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() storage = cls(wrap_storage=untyped_storage)

Press any key to continue... ` and it closes

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

1 - download a quantized model: I tried anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g and https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g 2 - add --wbits 4 --groupsize 128 to the start-webui.bat after the call python server.py --auto-devices --chat line 3 - run it and select the quantized model

Screenshot

No response

Logs

`Loading vicuna-13b-GPTQ-4bit-128g...
CUDA extension not installed.
Found the following quantized model: models\vicuna-13b-GPTQ-4bit-128g\vicuna-13b-4bit-128g.safetensors
Loading model ...
C:\dev\oobabooga-windows\installer_files\env\lib\site-packages\safetensors\torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
C:\dev\oobabooga-windows\installer_files\env\lib\site-packages\torch\_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
C:\dev\oobabooga-windows\installer_files\env\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
  
  Press any key to continue...
`

System Info

AMD FX8320E
16GB RAM
RTX2060 12GB VRAM

sharp-trickster avatar Apr 10 '23 05:04 sharp-trickster

did you install with one-click-installer, otherwise do the normal gptq installation - https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode

BarfingLemurs avatar Apr 10 '23 06:04 BarfingLemurs

https://github.com/oobabooga/text-generation-webui/issues/1000#issuecomment-1501884338

jllllll avatar Apr 10 '23 14:04 jllllll

Related https://github.com/oobabooga/text-generation-webui/issues/984

bbecausereasonss avatar Apr 10 '23 14:04 bbecausereasonss

I've read a lot, reinstalled the whole thing, and now I get a less cryptic error (cant find libcudart.so), which is a Linux format even tho I'm on windows and everything is installed (including CUDA). Upon further reading I THINK I know what's up: Quantized models use GPTQ, which in the default branch uses Triton, which is excluve to linux, and running on windows require WSL2 configuration or something. There is a CUDA branch in GPTQ, so, why not default to it? Or at least, let the user choose, I guess. To be fair, ever since I deleted everything and reinstalled, it stopped working for nonquantized models too and I'm currently running Galactica writing python code importing the transformers library and gave up. Just reporting my findings in case it somehow ends up being useful information for anyone else.

sharp-trickster avatar Apr 11 '23 11:04 sharp-trickster

did you install with one-click-installer, otherwise do the normal gptq installation - https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode

I tried it both ways, and the one-click installer worked for non quantized models, and then the second time the linux cuda files ended up being a requisite on launch even for the models that dont use GPTQ (as far as I understand it) so it stopped working altogether for the other models too. Thanks for trying to help, tho.

sharp-trickster avatar Apr 11 '23 11:04 sharp-trickster

I'm not sure this will actually help you with your problem but I was getting this error before with quantized models too and I set my virtual memory to "System managed size" for all my drives and it could finally load them. I hope it helps because I was really struggling to get quantized models to load before that change, no matter what I did.

Death-777 avatar Apr 12 '23 02:04 Death-777

If you're still having problems, you could try using my one-click installer https://github.com/oobabooga/one-click-installers/pull/21. If you still have any problems, I'll walk you through them and patch my installer

xNul avatar Apr 13 '23 17:04 xNul

I've read a lot, reinstalled the whole thing, and now I get a less cryptic error (cant find libcudart.so), which is a Linux format even tho I'm on windows and everything is installed (including CUDA). Upon further reading I THINK I know what's up: Quantized models use GPTQ, which in the default branch uses Triton, which is excluve to linux, and running on windows require WSL2 configuration or something. There is a CUDA branch in GPTQ, so, why not default to it? Or at least, let the user choose, I guess. To be fair, ever since I deleted everything and reinstalled, it stopped working for nonquantized models too and I'm currently running Galactica writing python code importing the transformers library and gave up. Just reporting my findings in case it somehow ends up being useful information for anyone else.

Even the CUDA GPTQ is having some problems right now, you should use oogabooga's fork. See Step 6 here

ltngonnguyen avatar Apr 14 '23 17:04 ltngonnguyen

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

github-actions[bot] avatar Oct 07 '23 23:10 github-actions[bot]