text-generation-webui Improve memory management

Improve memory management

Open Dampfinchen opened this issue 1 year ago • 18 comments

Hello,

I've noticed memory management with Oobabooga is quite poor compared to KoboldAI and Tavern. Here's some tests I've done:

Kobold AI + Tavern : Running Pygmalion 6B with 6 layers on my 6 GB RTX 2060 and FP16 with a context size of 1230 and with a character prompt of 600 tokens: I can chat for hours and not get a single out of memory error. It works flawlessly.

Oobabooga: Default --auto-devices. Setting context size to 1230 as well, same character prompt: immediately out of memory. Trying the standard character which has a very small character prompt: After a few chats I get out of memory, so I guess context size has to be set below 100 tokens. It's that bad.

This is on Windows btw.

I've even tried setting --gpu-memory to 6 but then I can't even load the model without getting OOM. Lower contex size to sensible levels even further does not really work either.

For Oobabooga to be the Automatic1111 UI for text generation, memory management needs an overhaul imo. Because for more competent language models it's completely unuseable right now on mainstream hardware.

Mar 19 '23 17:03 Dampfinchen

6 is your entire GPU, leave some room for the browser and windows/xorg/etc.

Mar 19 '23 17:03 Ph0rk0z

6 is your entire GPU, leave some room for the browser and windows/xorg/etc.

I know. I've only tried 6 GB because 5 GB (which --auto-devices is setting it to) is showing the issues I've demonstrated in my post.

By the way, with Kobold AI I can have a decent amount of browser tabs open and never get OOM. Even without anything open at all, Oobabooga is getting out of memory very quickly with the standard character (usually around 3 messages)

Mar 19 '23 17:03 Dampfinchen

set it to 4. It doesn't come fine-tuned I think. You have to do it all yourself. I had issue running pylymion6b on 4gb until I did 2, and lowered the max token settings, and new token settings

Try lowering max token to 700 and new token to 100

Mar 19 '23 17:03 BarfingLemurs

set it to 4. It doesn't come fine-tuned I think. You have to do it all yourself. I had issue running pylymion6b on 4gb until I did 2, and lowered the max token settings, and new token settings

Try lowering max token to 700 and new token to 100

I think I was trying out 4 too and it wasn't working well. And if you have to lower context size by that much (which I did btw and that also wasn't working), it's really not worth it if I can run 6B using KoboldAI at a contex size of 1230.

Mar 19 '23 17:03 Dampfinchen

along with the other advice you can try using a custom launcher script to set an aggressive vram management schema, like this linux bash example allocates in 128mb chunks, and starts trying to optimize when over 50% of vram is utilized:

#!/bin/bash export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.5,max_split_size_mb:128M python server.py

Mar 19 '23 17:03 noxiouscardiumdimidium

I have heard this "oobabooga can't do memory management" meme quite a few times, usually from users who try a single value of --gpu-memory, get an error, and then for reasons unknown don't try reducing the number like the guide instructs https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide

Mar 19 '23 17:03 oobabooga

I have heard this "oobabooga can't do memory management" meme quite a few times, usually from users who try a single value of --gpu-memory, get an error, and then for reasons unknown don't try reducing the number like the guide instructs https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide

I was using --auto-devices without gpu-memory X attached at all when conducting these tests (in the command line its telling me it was setting --gpu-memory 5). So basically Ooba out of the box.

I also tried setting it to 4 with subpar results. However, I was replacing it with auto-devices. Maybe that's a mistake on my part. I will check with --auto-devices and --gpu-memory 4 soon and I will report back if that fixes it.

Mar 19 '23 17:03 Dampfinchen

I have heard this "oobabooga can't do memory management" meme quite a few times, usually from users who try a single value of --gpu-memory, get an error, and then for reasons unknown don't try reducing the number like the guide instructs https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide

Alright. So I've tested --auto-devices and gpu-memory 4. Still out of memory immediately.

However, I am now using --auto-devices and gpu-memory 3 and it appears to be working fine. With 0.7 token/s its a lot slower than KoboldAI though (1.1 token/sec).

Still, with gpu memory set to 3 it works a lot better than previously. I guess instead of seeing the number as your total video memory amount, you have to look at it from a different perspective of your video memory amount - the amount needed of your contex size. I was avoiding setting it to 3 previously because I thought no way I'm going to waste half my video memory. Well now I know better because the video memory needed for the contex size has to be taken into account as well.

I will update this with long term information as this is just the first impression, so be sure to check out of from time to time! Thank you for your good work.

Mar 19 '23 18:03 Dampfinchen

Can we use value like 3.5 for this? I only tried with solid numbers but it sticks when I put --gpu-memory 20 or 22. It used to go over before llama was introduced.

And have you used nvtop or GPU-Z on windows to see actual memory? The console says how much you go over too.

You can save some memory by disabling visual effects and browser HW acceleration too.

Mar 19 '23 18:03 Ph0rk0z

I have added two options for finer VRAM control:

--gpu-memory with explicit units (as @Ph0rk0z suggested). This now works: --gpu-memory 3457MiB
--no-cache. This reduces VRAM usage a bit while generating text. It has a performance cost, but it may allow you to set a higher value for --gpu-memory resulting in a net gain.

https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide#split-the-model-across-your-gpu-and-cpu

Mar 19 '23 22:03 oobabooga

I have added two options for finer VRAM control:

--gpu-memory with explicit units (as @Ph0rk0z suggested). This now works: --gpu-memory 3457MiB

--no-cache. This reduces VRAM usage a bit while generating text. It has a performance cost, but it may allow you to set a higher value for --gpu-memory resulting in a net gain.

https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide#split-the-model-across-your-gpu-and-cpu

Thank's a lot for reacting to feedback that fast. It is appreciated!

Here are my new tests, this time I'm measuring performance:

Settings were identical.

gpu-memory set to 3, example character with cleared contex, contex size 1230, four messages back and forth: 85 token/second.
gpu-memory set to 3450MiB (basically the highest value I'm allowed to go when I use bots with a chat history+description of 1230 tokens), example character with cleared contex, four messages back and forth: 87 token/second.
gpu-memory set to 4 but with command --no-cache: 76 token/second

Hmm.. it doesn't seem like it does much for performance and is in the margin of error. You would expect token/sec to be noticeably higher with more layers dedicated to the GPU. One thing I did notice though is that shared memory is not used at all.

vram

In KoboldAI, where I get around 1.1 tokens/per second under same conditions (for a fair comparison, token streaming has been disabled with the Ooba command --no-stream), shared memory is completely full as well, which is not the case here. I know shared memory is basically just RAM, but perhaps there's some Windows trickery going on that optimizes it between VRAM and RAM that is not used in Ooba.

If you want me to run some tests, let me know. I'm will be glad to assist you optimizing Ooba even further.

Mar 20 '23 09:03 Dampfinchen

Make sure to clear context and use the exact same prompts/settings. Preferably in some mode where you get the exact same response back. I.e. Disable do_sample. Otherwise it gets really hard to compare the benchmarks and it's easy to trick yourself.

Mar 20 '23 12:03 Ph0rk0z

I will say this, on windows I've never gotten ooba to offload anything to sysram or even shared sysram without using deepspeed (despite the agony that is getting deepspeed running on WSL2).

For a model that you can fit totally in VRAM, memory management is real good. But something's afoot when it comes to splitting models in my experience on windows, I've never seen it touch shared memory. And considering I have a total of 16GB of shared memory, that's quite a difference (there's uh... a lot of models that will fit in 24GB but not 8GB).

Mar 21 '23 02:03 LTSarc

Isn't "shared GPU memory" on Windows exactly the same thing as the CPU offloading that is achieved with --auto-devices --gpu-memory 4000MiB?

Mar 21 '23 02:03 oobabooga

It should be, but I've never seen the model loaded into system RAM in task manager - and due to the way memory sharing is supposed to work you should be able to just say "load it all on the VRAM" and when VRAM runs out windows automagically puts it on the shared bit of the system RAM.

e.g. if you set textures too high in a game you'll see pop-in and artifacts due to the speed issues, but the program doesn't just faceplant in an "out of memory" error (unless you actually run out of system memory).

I am not sure how much this is on the webui, how much is on the various libraries, and how much is on the windows memory management stack.

Mar 21 '23 03:03 LTSarc

Further update on my end: even on Linux this happens. Was using --auto-devices and --gpu-memory 6

Not only is the thing faceplanting in a OOM error when the VRAM runs out (the amount allocated to pytorch exactly matches my VRAM minus overhead) instead of using system memory or god forbid the page file, it seems to be completely ignoring the limit set by gpu-memory for VRAM. All of the python code for this program is a few hundred kb, it's not other bits of the program taking up pytorch's memory allocation.

I am pretty sure this shouldn't be happening with auto-devices and a gpu-memory flag, even if it has to slow to a crawl to use other memory.

Mar 21 '23 05:03 LTSarc

I will say this, on windows I've never gotten ooba to offload anything to sysram or even shared sysram without using deepspeed (despite the agony that is getting deepspeed running on WSL2).

For a model that you can fit totally in VRAM, memory management is real good. But something's afoot when it comes to splitting models in my experience on windows, I've never seen it touch shared memory. And considering I have a total of 16GB of shared memory, that's quite a difference (there's uh... a lot of models that will fit in 24GB but not 8GB).

Yup. For comparison's sake, here's what 6 gpu layers look like when Pygmalion 6B is just loaded in KoboldAI:

modelloaded

So with a full contex size of 1230, I'm getting 1.08 t/sec when the VRAM is close to being full in KoboldAI (5.8 GB / 6 GB).

It appears its handled a bit differently compared to games. Shared Memory is being used completely, even before the VRAM fills up.

Mar 21 '23 12:03 Dampfinchen

what if i have high vram but low ram ?

Mar 23 '23 11:03 alkeryn

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

Apr 22 '23 23:04 github-actions[bot]

I have heard this "oobabooga can't do memory management" meme quite a few times, usually from users who try a single value of --gpu-memory, get an error, and then for reasons unknown don't try reducing the number like the guide instructs https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide

yeah but the issue is that llamacpp with llamacpp_HF both somehow allocate more memory than they ever use. i have two 16Gb planks of RAM and when loading a model (phind-codellama-34b-v2.Q4_K_M) it allocates around 22Gb while using just 8Gb. (in cmd it also states that it requires around 10Gb in total) also it never releases memory during generation, which leads to it leaking even more

all this started after i ran update bat for windows, and it gave me some of git's conflicts which i had to fix on my own. after that, it started to allocate around 3x more memory than it actually NEEDS for some odd reason.

made a separate bat file to launch gradio much quicker, here:

@echo off
call installer_files\conda\condabin\conda.bat activate installer_files\env
python server.py --xformers --rwkv-cuda-on --auto-launch --auto-devices --gpu-memory 10 --disk

also when a model is done loading it starts leaking 1MB/s which is also not nice. i have no idea what is the cause of this, either the python API that wraps llama.cpp or llama.cpp itself, but this just makes the text generation unusable.

trying to use layers offloading results in nothing too, it still allocates 3x more memory than it actually uses, but this time generation will take much more time and will crash much later (still doesn't even go to the point of outputting the first token)

Sep 30 '23 13:09 alexlnkp

Try the mathmul kernels, they use less vram.

Sep 30 '23 14:09 Ph0rk0z

text-generation-webui text-generation-webui copied to clipboard

Improve memory management

text-generation-webui
text-generation-webui copied to clipboard