text-generation-webui Can't load llama13b-4bit.pt without using CPU ram, extremely slow ~ 20 seconds/token

Describe the bug

I have an NVIDIA 3050 w/ 4GB VRAM. I know, not the beefiest setup by far but it works very nicely with GPT4ALL with all their built in models, perhaps 10 tokens/second on the gpt4all-l13B-snoozy model. (running with built in defaults)

I'd like to use oobabooga & text-generation-webui since the interface has many more options, so I download the model llama13b-4bit.pt model, but attempting to use it with defaults provided a memory error due to the ~4GB Vram. But the snoozy mode (I think comparable to llama13b?) runs fine in GPT4ALL.

I can run llama13b in oobabooga by offloading to CPU ram and the --disk (command python server.py --chat --cpu-memory 8 --model llama-13b --disk) but this results in prompts being answered in roughly 1 token per 20-30 seconds.

Am I doing something wrong (and if so guidance on the commands I should run would be great!) or are the two models simply not comparable as I thought they were?

Thanks!

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

install vanilla text-generation-web ui one-click on windows w. 4GB VRAM NVIDIA 3050. Download llama13b-4bit.pt per instructions here: https://gist.github.com/lxe/82eb87db25fdb75b92fa18a6d494ee3c

I also download 41 .bin files each ~1GB each.

I run start_windows.bat and choose llama13b. In then begins loading the 41 s

When I try to load the model in text-generation-webui I receive message that 136MB could not be loaded because 3.5GB are already in use, full log below.

Screenshot

No response

Logs

>start_windows.bat
INFO:Gradio HTTP request redirected to localhost :)
bin C:\Users\Apollo\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll
The following models are available:

1. EleutherAI_pythia-410m-deduped
2. llama-13b

Which one do you want to load? 1-2

2

INFO:Loading llama-13b...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 41/41 [03:40<00:00,  5.37s/it]
Traceback (most recent call last):
  File "C:\Users\Apollo\Desktop\oobabooga_windows\text-generation-webui\server.py", line 872, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "C:\Users\Apollo\Desktop\oobabooga_windows\text-generation-webui\modules\models.py", line 90, in load_model
    model = model.cuda()
  File "C:\Users\Apollo\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 905, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "C:\Users\Apollo\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "C:\Users\Apollo\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "C:\Users\Apollo\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "C:\Users\Apollo\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 820, in _apply
    param_applied = fn(param)
  File "C:\Users\Apollo\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 905, in <lambda>
    return self._apply(lambda t: t.cuda(device))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 4.00 GiB total capacity; 3.48 GiB already allocated; 0 bytes free; 3.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Done!
Press any key to continue . . .

System Info

CPU: AMD Ryzen 7 5800H (it has onboard GPU but I set default to the discrete GPU in OS settings)
System Ram: 16GB
OS: 10 Windows 22H2 (64 Bit)
GPU: NVIDIA RTX 3050 w/ 4GB Ram. I don't know the OEM for it, it's a discrete card in a laptop, so maybe NVIDIA? Settings in Device Manage don't have an obvious answer.

May 06 '23 21:05 ineedasername

it works very nicely with GPT4ALL with all their built in models, perhaps 10 tokens/second on the gpt4all-l13B-snoozy model.

Can you link to the software that you're using for this? It is very unlikely that you are running a 13B model on 4GB of vRAM locally.

May 06 '23 22:05 ClayShoaf

@ClayShoaf I used the GPT4All one click installer found through this link: https://gpt4all.io/installers/gpt4all-installer-win64.exe

upon startup it allows users to download a list of models, one being the one I mentioned above. It doesn't have the exact same name as the oobabooga llama-13b model though so there may be fundamental differences. The 13b snoozy model from GPT4ALL is about 8GB, if that metric helps understand anything about the nature of the potential differences. I suppose I may be able to use the snoozy mode with oobabooga to have the advanced text-generation-webui capabilities, thought I'm not quite sure how I would do that so it would take quite a bit of tinkering.

I'm a bit embarassed by all of this: I have Masters degree in Applied Computational Linguistics from ~2008 and did my thesis on Word Sense Disambiguation. but these LLM's have comprehensively solved that problem far beyond the immediate horizon that was the state of the art at the time. It makes me feel old & antiquated. (my day job benefits from what I learned but doesn't require me to stay up to date in the field). GPT & LLMs are just within my ability to understand the bare bones basics without them seeming like complete magic because I've used neural nets over the years to create classification models and a tiny bit of deep learning (my data sets aren't big enough to take full advantage of deep learning) I'm not "old" and yet the state of the art has far passed me by, to my chagrin and amusement and amazement and joy, but also trepidation.

May 07 '23 00:05 ineedasername

Yea, everything is moving insanely quickly. I am currently out of work, so it has allowed to to try to catch up with everything.

Aside from that, there is no way that you are loading an 8GB model into 4GB of vRAM. One possibility is that it could swap layers with a swapfile, or with CPU RAM, but either of those options are not going to give you anywhere close to 10tokens/second.

What I suspect is probably happening is that some other, lower parameter-size model is actually being loaded.

May 07 '23 01:05 ClayShoaf

I just looked it up and a 3050 has 8GB of vRAM, not 4. If that's the case, you can probably run a 4bit quantized version of 13b, like this one: https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-4bit-128g/tree/main

You will probably have to set a smaller context size though. You'll have to play around with it to figure out where you max out.

May 07 '23 01:05 ClayShoaf

@ClayShoaf: My RTX 3050 may have less ram because it's in a laptop. I double checked using speccy thought and it confirmed 4GB VRAM. I then also made sure this was the only one, backing other up in a sibling folder. I then restarted so it would initialize with this mode by default, and it had no problem initializing and accepting prompts. Regarding token generation performance:

You were rights. I guess it just seemed so fast because I tinkering with other slow models first, and when I got to this one it seems so fast in comparison. So, I used a stopwatch and timed it. The actual output rate was about 1.7 seconds per token. So hey, I was only off by a little more than an order of magnitude! Not so bad when I could have been off by 4 or 5. Any insight into

May 07 '23 03:05 ineedasername

Honestly, try to use a GGML model that runs on the cpu instead. 4 GB is just hopeless to run anyhing on the GPU. You can technically split the model between vram and ram, where it still runs on the gpu but with only 4GB that is going to be painful speed.

Even if you had 8GB, I would say no to running a 13B on GPU only.

So look for models that have GGML in the name and you can load them in textgen and they will always use CPU.

May 07 '23 05:05 LaaZa

@Laaza Thanks so much for the suggestion. I am, however, an outsider to this sort of modeling and searching for how to do things comes up with a lot of SEO spam like "Free ChatGPT on Your PC!" the regurgiatated the obvious or points to the same resources.

Would you be kind enough to explain what GGML is, and where some sample models could be located? I know, I know, a DIY approach might teach me more through the effort. But to be honest, outside this hobby project (with some professional possibilities) I'm, well, not in a good place in myself right now. I could use a little extra hand holding here. I Completely understand that I have no inherent right to your time helping me and have no problem at all if you have other things to do than hand hold someone who is just about knowledgeable enough to explore these things given a few days of trial and error.

Thank for your time already

May 07 '23 16:05 ineedasername

I dug a little further and it turns out the models, or at least the 13B snoozy one, is already GGML. That would explain getting around the typical ram requirements. Would you be kind enough to give an ELI5 on GGML, and if there is a straightforward way of converting other models to it? Not instructions or anything, just if its possible. I can do the rest. Then I'll close out this issue.

EDIT: I didn't have to specify it to run on CPU when I loaded into oobabooga, it just worked. I checked the model settings in the web interface and CPU wasn't selected.

May 07 '23 18:05 ineedasername

text-generation-webui text-generation-webui copied to clipboard

Can't load llama13b-4bit.pt without using CPU ram, extremely slow ~ 20 seconds/token

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard