text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

error while trying to start it

Open TheCodeInjection opened this issue 2 years ago • 19 comments

Describe the bug

when i run the start bat file i get this error

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

idk

Screenshot

No response

Logs

Gradio HTTP request redirected to localhost :)
bin C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll
Loading gpt4-x-alpaca-13b-native-4bit-128g...
Found the following quantized model: models\gpt4-x-alpaca-13b-native-4bit-128g\gpt-x-alpaca-13b-native-4bit-128g-cuda.pt
Loading model ...
Done.
Traceback (most recent call last):
  File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\text-generation-webui\server.py", line 914, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\models.py", line 158, in load_model
    model = load_quantized(model_name)
  File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\GPTQ_loader.py", line 197, in load_quantized
    model = model.to(torch.device('cuda:0'))
  File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1145, in to
    return self._apply(convert)
  File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 844, in _apply
    self._buffers[key] = fn(buf)
  File "C:\Users\areleh\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 4.00 GiB total capacity; 3.38 GiB already allocated; 0 bytes free; 3.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Done!
Press any key to continue . . .

System Info

Device name	LT123
Processor	11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz   3.30 GHz
Installed RAM	24.0 GB (23.7 GB usable)
Device ID	60B4CA00-2E64-4BDD-BA14-0A50677D5DFC
Product ID	00325-97208-78436-AAOEM
System type	64-bit operating system, x64-based processor

TheCodeInjection avatar May 01 '23 18:05 TheCodeInjection

If you only have 4 GB of VRAM you are never going to be able to load a 13B model onto GPU. Look into trying GGML models.

LaaZa avatar May 01 '23 18:05 LaaZa

how do i add vram?

TheCodeInjection avatar May 01 '23 18:05 TheCodeInjection

You don't. It depends on your GPU. But you can try GGML models since they run on the cpu and use system RAM. Not going to be fast though.

LaaZa avatar May 01 '23 18:05 LaaZa

I have this same issue, but I have a 4090 and 64GB of RAM. I've tried setting the pytorch max_split_size_mb to 512 in my OS env vars. (set 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512' on windows). I also set a paging file on the same disk as the oobabooga install. No change to the error.

PS H:\llms\oobabooga> .\start-webui.bat
Starting the web UI...
Gradio HTTP request redirected to localhost :)
Loading settings from settings.json...
Loading stable-vicuna-13b...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00,  2.54s/it]
Traceback (most recent call last):
  File "H:\llms\oobabooga\text-generation-webui\server.py", line 914, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "H:\llms\oobabooga\text-generation-webui\modules\models.py", line 89, in load_model
    model = model.cuda()
  File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 905, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 820, in _apply
    param_applied = fn(param)
  File "H:\llms\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 905, in <lambda>
    return self._apply(lambda t: t.cuda(device))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 23.99 GiB total capacity; 22.99 GiB already allocated; 0 bytes free; 22.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

FieldMarshallVague avatar May 01 '23 19:05 FieldMarshallVague

13B LLaMA model will not fit 24 GB of vram. You need to load it in either 8bit with load_in_8bit or use a GPTQ quantized model which are usually 4bit.

LaaZa avatar May 01 '23 20:05 LaaZa

@LaaZa Oh, really? I didn't realise this was an issue until the 30b and up. Thanks.

Where do I use the 8-bit flag? And do you know if I can quantize the model myself? I'm using the stable vicuna deltas to create a new model from the huggyllama/llama-13b.

EDIT: I realised there is a 'load in 8-bit' flag in the model settings in Oogabooga. But this might not help people who can't get that far.

FieldMarshallVague avatar May 02 '23 08:05 FieldMarshallVague

Just set the --load-in-8bit flag or check that option in webui when you load the model.

For GPTQ quantization you could use AutoGPTQ or GPTQ-for-LLaMa. Currently textgen uses the latter for inference, but they will both work. Personally I would use AutoGPTQ and I'm also implementing it for textgen inference. But either will work with textgen.

LaaZa avatar May 02 '23 09:05 LaaZa

That's great thankyou. I'll check them out for other models (this one is working well with the 8-bit flag set). Thanks for your help! :)

FieldMarshallVague avatar May 02 '23 09:05 FieldMarshallVague

@FieldMarshallVague I'm able to load 13b with 24GB of VRAM using the --gpu-memory flag. I append --gpu-memory 21 and that fixes all of my memory allocation errors without reducing model accuracy

xNul avatar May 02 '23 18:05 xNul

@xNul Oh, interesting. I'm using the CarperAI stable vicuna 13b. It works fine with load-in-8bit mode, but throws errors during conversations otherwise. It looks like the model size should just about fit in the VRAM of a 4090, so maybe it's just a case of tweaking things...

Do you think the quantization actually causes some accuracy? I've seen conflicting arguments and have stopped worrying about it. But I can imagine how it might.

FieldMarshallVague avatar May 02 '23 20:05 FieldMarshallVague

Setting gpu-memory will offload to ram any excess. May make inference much slower. Honestly the very tiny degradation with quanitzation is usually well worth the tradeoff.

LaaZa avatar May 02 '23 21:05 LaaZa

I didn't realise it was offloading to RAM, but of course, it makes sense. Yeah, seems like the consensus is quantization is fine for most use cases.

FieldMarshallVague avatar May 02 '23 22:05 FieldMarshallVague

@xNul Oh, interesting. I'm using the CarperAI stable vicuna 13b. It works fine with load-in-8bit mode, but throws errors during conversations otherwise. It looks like the model size should just about fit in the VRAM of a 4090, so maybe it's just a case of tweaking things...

Do you think the quantization actually causes some accuracy? I've seen conflicting arguments and have stopped worrying about it. But I can imagine how it might.

@FieldMarshallVague By definition, quantization reduces accuracy because you're reducing the precision of the model weights to fit the model into memory. When you throw away half your data (16bit -> 8bit), you lose accuracy. Granted, it is the least impactful half of your data so you don't lose much, but there is accuracy loss. Throwing away 75% of your data (16bit -> 4bit) is a much larger amount of accuracy loss.

I'm not sure how much accuracy is lost with 8bit quantization or if it really matters, but I'm getting great results with unquantized vicuna-13b and inferencing speed despite offloading a few gigs, is still good with my setup (faster than reading speed while streaming the output). I think there is a significant accuracy reduction with 4bit quantization though. Too much for many scenarios.

xNul avatar May 03 '23 04:05 xNul

Honestly though, if we take into account the immense reduction in memory requirement, the differences in perplexity scores are insignificantly small.

This is a comparison table from llama.cpp(different to GPTQ but similar results):

Model Measure F16 Q4_0 Q4_1 Q4_2 Q5_0 Q5_1 Q8_0
7B perplexity 5.9565 6.2103 6.1286 6.1698 6.0139 5.9934 5.9571
7B file size 13.0G 4.0G 4.8G 4.0G 4.4G 4.8G 7.1G
7B ms/tok @ 4th 128 56 61 84 91 95 75
7B ms/tok @ 8th 128 47 55 48 53 59 75
7B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0
13B perplexity 5.2455 5.3748 5.3471 5.3433 5.2768 5.2582 5.2458
13B file size 25.0G 7.6G 9.1G 7.6G 8.4G 9.1G 14G
13B ms/tok @ 4th 239 104 113 160 176 185 141
13B ms/tok @ 8th 240 85 99 97 108 117 147
13B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0

The effect generally is smaller for larger models and the savings may allow you to use a tier larger model instead, which will be better.(assuming one is available)

LaaZa avatar May 03 '23 08:05 LaaZa

These are very interesting points, thanks. It's occurred to me that if quantization is just ignoring the extra detail in the longer byte lengths, then aren't we just looking at the left-most digits? And couldn't we therefore just not quantize things and just only read the bits we need? i.e. download a 16-bit model and only read 4 bits into memory? Rather than re-write the model?

I guess the answer is that it would take WAY longer to stream than into memory when you have to skip so many bits. But with NVME drives and it might only be 2-4 times longer...? Just a thought. That might, in fact, be what 'load-in-8bit' mode is doing! :D

I still don't have a good intuition for how much the Qaunatization affects accuracy, but guess it may depend on the parameters themselves (e.g. some more 'interconnected' ones will suffer greater loss, perhaps). I'm very new to this and haven't built one from the ground up. Fascinating stuff, though!

FieldMarshallVague avatar May 03 '23 09:05 FieldMarshallVague

No, the quantization does a lot of smart things to minimize negative impact. Every format has differences in bit allocation(what they are used for) so we can't just "chop it off". Furthermore 8bit and lower quantizations are usually integers instead of floats. These methods are optimized for machine learning use.

A common alternative used for 16bit floats in machine learning is the bfloat16 that uses different allocation that actually aligns with 32bit floats but with less precision, they can be technically converted between 32bit and bfloat16 just by chopping off bits at the end or padding with zeros when going the other way.

LaaZa avatar May 03 '23 10:05 LaaZa

@LaaZa wow yeah, those differences are pretty minor. I'm not very familiar with perplexity though. Is it able to reflect model accuracy well?

xNul avatar May 03 '23 14:05 xNul

@LaaZa Thanks, that's a very insightful explanation. so bfloats seem closed to what I was thinking, but "it's complicated". I've seen these cropping up in various places, so thanks for making their purpose a bit clearer.

FieldMarshallVague avatar May 03 '23 14:05 FieldMarshallVague

@LaaZa wow yeah, those differences are pretty minor. I'm not very familiar with perplexity though. Is it able to reflect model accuracy well?

It measures how well the model can predict the next word in the test dataset. Smaller numbers are better, but obviously here we compare the difference. It is a bit hard to say how certain difference affects the output exactly, but the differences are so small that any difference in text you "notice" is likely placebo.

LaaZa avatar May 03 '23 14:05 LaaZa

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar Jun 02 '23 23:06 github-actions[bot]