text-generation-webui
text-generation-webui copied to clipboard
error while trying to start it
Describe the bug
when i run the start bat file i get this error
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
idk
Screenshot
No response
Logs
Gradio HTTP request redirected to localhost :)
bin C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll
Loading gpt4-x-alpaca-13b-native-4bit-128g...
Found the following quantized model: models\gpt4-x-alpaca-13b-native-4bit-128g\gpt-x-alpaca-13b-native-4bit-128g-cuda.pt
Loading model ...
Done.
Traceback (most recent call last):
File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\text-generation-webui\server.py", line 914, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\models.py", line 158, in load_model
model = load_quantized(model_name)
File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\GPTQ_loader.py", line 197, in load_quantized
model = model.to(torch.device('cuda:0'))
File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 1896, in to
return super().to(*args, **kwargs)
File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1145, in to
return self._apply(convert)
File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 844, in _apply
self._buffers[key] = fn(buf)
File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 4.00 GiB total capacity; 3.38 GiB already allocated; 0 bytes free; 3.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Done!
Press any key to continue . . .
System Info
Device name LT123
Processor 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz 3.30 GHz
Installed RAM 24.0 GB (23.7 GB usable)
Device ID 60B4CA00-2E64-4BDD-BA14-0A50677D5DFC
Product ID 00325-97208-78436-AAOEM
System type 64-bit operating system, x64-based processor
If you only have 4 GB of VRAM you are never going to be able to load a 13B model onto GPU. Look into trying GGML models.
how do i add vram?
You don't. It depends on your GPU. But you can try GGML models since they run on the cpu and use system RAM. Not going to be fast though.
I have this same issue, but I have a 4090 and 64GB of RAM. I've tried setting the pytorch max_split_size_mb to 512 in my OS env vars. (set 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512' on windows). I also set a paging file on the same disk as the oobabooga install. No change to the error.
PS H:\llms\oobabooga> .\start-webui.bat
Starting the web UI...
Gradio HTTP request redirected to localhost :)
Loading settings from settings.json...
Loading stable-vicuna-13b...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00, 2.54s/it]
Traceback (most recent call last):
File "H:\llms\oobabooga\text-generation-webui\server.py", line 914, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "H:\llms\oobabooga\text-generation-webui\modules\models.py", line 89, in load_model
model = model.cuda()
File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 905, in cuda
return self._apply(lambda t: t.cuda(device))
File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 820, in _apply
param_applied = fn(param)
File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 905, in <lambda>
return self._apply(lambda t: t.cuda(device))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 23.99 GiB total capacity; 22.99 GiB already allocated; 0 bytes free; 22.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
13B LLaMA model will not fit 24 GB of vram. You need to load it in either 8bit with load_in_8bit or use a GPTQ quantized model which are usually 4bit.
@LaaZa Oh, really? I didn't realise this was an issue until the 30b and up. Thanks.
Where do I use the 8-bit flag? And do you know if I can quantize the model myself? I'm using the stable vicuna deltas to create a new model from the huggyllama/llama-13b.
EDIT: I realised there is a 'load in 8-bit' flag in the model settings in Oogabooga. But this might not help people who can't get that far.
Just set the --load-in-8bit flag or check that option in webui when you load the model.
For GPTQ quantization you could use AutoGPTQ or GPTQ-for-LLaMa. Currently textgen uses the latter for inference, but they will both work. Personally I would use AutoGPTQ and I'm also implementing it for textgen inference. But either will work with textgen.
That's great thankyou. I'll check them out for other models (this one is working well with the 8-bit flag set). Thanks for your help! :)
@FieldMarshallVague I'm able to load 13b with 24GB of VRAM using the --gpu-memory flag. I append --gpu-memory 21 and that fixes all of my memory allocation errors without reducing model accuracy
@xNul Oh, interesting. I'm using the CarperAI stable vicuna 13b. It works fine with load-in-8bit mode, but throws errors during conversations otherwise. It looks like the model size should just about fit in the VRAM of a 4090, so maybe it's just a case of tweaking things...
Do you think the quantization actually causes some accuracy? I've seen conflicting arguments and have stopped worrying about it. But I can imagine how it might.
Setting gpu-memory will offload to ram any excess. May make inference much slower. Honestly the very tiny degradation with quanitzation is usually well worth the tradeoff.
I didn't realise it was offloading to RAM, but of course, it makes sense. Yeah, seems like the consensus is quantization is fine for most use cases.
@xNul Oh, interesting. I'm using the CarperAI stable vicuna 13b. It works fine with load-in-8bit mode, but throws errors during conversations otherwise. It looks like the model size should just about fit in the VRAM of a 4090, so maybe it's just a case of tweaking things...
Do you think the quantization actually causes some accuracy? I've seen conflicting arguments and have stopped worrying about it. But I can imagine how it might.
@FieldMarshallVague By definition, quantization reduces accuracy because you're reducing the precision of the model weights to fit the model into memory. When you throw away half your data (16bit -> 8bit), you lose accuracy. Granted, it is the least impactful half of your data so you don't lose much, but there is accuracy loss. Throwing away 75% of your data (16bit -> 4bit) is a much larger amount of accuracy loss.
I'm not sure how much accuracy is lost with 8bit quantization or if it really matters, but I'm getting great results with unquantized vicuna-13b and inferencing speed despite offloading a few gigs, is still good with my setup (faster than reading speed while streaming the output). I think there is a significant accuracy reduction with 4bit quantization though. Too much for many scenarios.
Honestly though, if we take into account the immense reduction in memory requirement, the differences in perplexity scores are insignificantly small.
This is a comparison table from llama.cpp(different to GPTQ but similar results):
| Model | Measure | F16 | Q4_0 | Q4_1 | Q4_2 | Q5_0 | Q5_1 | Q8_0 |
|---|---|---|---|---|---|---|---|---|
| 7B | perplexity | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0139 | 5.9934 | 5.9571 |
| 7B | file size | 13.0G | 4.0G | 4.8G | 4.0G | 4.4G | 4.8G | 7.1G |
| 7B | ms/tok @ 4th | 128 | 56 | 61 | 84 | 91 | 95 | 75 |
| 7B | ms/tok @ 8th | 128 | 47 | 55 | 48 | 53 | 59 | 75 |
| 7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
| 13B | perplexity | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.2768 | 5.2582 | 5.2458 |
| 13B | file size | 25.0G | 7.6G | 9.1G | 7.6G | 8.4G | 9.1G | 14G |
| 13B | ms/tok @ 4th | 239 | 104 | 113 | 160 | 176 | 185 | 141 |
| 13B | ms/tok @ 8th | 240 | 85 | 99 | 97 | 108 | 117 | 147 |
| 13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
The effect generally is smaller for larger models and the savings may allow you to use a tier larger model instead, which will be better.(assuming one is available)
These are very interesting points, thanks. It's occurred to me that if quantization is just ignoring the extra detail in the longer byte lengths, then aren't we just looking at the left-most digits? And couldn't we therefore just not quantize things and just only read the bits we need? i.e. download a 16-bit model and only read 4 bits into memory? Rather than re-write the model?
I guess the answer is that it would take WAY longer to stream than into memory when you have to skip so many bits. But with NVME drives and it might only be 2-4 times longer...? Just a thought. That might, in fact, be what 'load-in-8bit' mode is doing! :D
I still don't have a good intuition for how much the Qaunatization affects accuracy, but guess it may depend on the parameters themselves (e.g. some more 'interconnected' ones will suffer greater loss, perhaps). I'm very new to this and haven't built one from the ground up. Fascinating stuff, though!
No, the quantization does a lot of smart things to minimize negative impact. Every format has differences in bit allocation(what they are used for) so we can't just "chop it off". Furthermore 8bit and lower quantizations are usually integers instead of floats. These methods are optimized for machine learning use.
A common alternative used for 16bit floats in machine learning is the bfloat16 that uses different allocation that actually aligns with 32bit floats but with less precision, they can be technically converted between 32bit and bfloat16 just by chopping off bits at the end or padding with zeros when going the other way.
@LaaZa wow yeah, those differences are pretty minor. I'm not very familiar with perplexity though. Is it able to reflect model accuracy well?
@LaaZa Thanks, that's a very insightful explanation. so bfloats seem closed to what I was thinking, but "it's complicated". I've seen these cropping up in various places, so thanks for making their purpose a bit clearer.
@LaaZa wow yeah, those differences are pretty minor. I'm not very familiar with perplexity though. Is it able to reflect model accuracy well?
It measures how well the model can predict the next word in the test dataset. Smaller numbers are better, but obviously here we compare the difference. It is a bit hard to say how certain difference affects the output exactly, but the differences are so small that any difference in text you "notice" is likely placebo.
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.