text-generation-webui
text-generation-webui copied to clipboard
Support models larger than RAM / VRAM
Description
A clear and concise description of what you want to be implemented. indicates it should be possible to execute models larger than the RAM available on the GPU.
Can oobabooga implement this or an analogue of this?
Thanks in advance! Your project is extremely cool. I hope to continue pushing new PRs as I continue to discover more and more about this project.
EDIT: this ticket is more or less about either documenting this facility in a prominent place, or making one of these modes the default and document how to disable or change mode.
Offloading already works in this repo. With deepspeed, flexgen, and accelerate.
Offloading already works in this repo. With deepspeed, flexgen, and accelerate.
Does accelerate
(--auto-devices
) uses lazy loading for big models, which are lager than RAM? Like llamacpp
with mmap.
I've tried to use it, but server.py
exits probably because of OOM. No errors btw.
It probably depends on how you sharded the model too. If it's in 3 chunks yea.. might be a problem.
It probably depends on how you sharded the model too.
Oh, it's clear now: my model is not sharded at all. I thought that --auto-devices
or deepspeed
load models using some "magic" under the hood)
BTW, I'm using quatized models, which are not sharded by default. Does it mean no go with deepspeed
and accelerate
?
Quantized will load in accelerate but not deepspeed.
Quantized will load in accelerate
Ok, I'll give it a try one more time. Somehow accelerate suddenly stopped working - I don't see the regular message where it says how much VRAM/RAM accelerate assigned.
python server.py --auto-devices --share --wbits 4 --groupsize 128 --model_type llama --listen --model xxxxxx-safetensor
Trying to load a model. which is > RAM and server.py exits because of OOM
If even the quantized model is > than your ram you will have to make a big swap file and load it that way. You can also use pre-layer for llama.
Note: at least on my machine, DeepSpeed is no faster than --auto-devices
with or without --8bits
.
I'm seeing the same behavior. Big safetensor file (MetalX/GPT4-X-Alpaca-30B-4bit), relevant settings like @mrdc's previous comment, crash with no error message in model.load_state_dict
. I have 32GB of CPU RAM, 24GB of VRAM, and a big swap file of 128GB. It should work, not sure why it isn't...
Weirdly, If using 0cc4m's KoboldAI latestgptq branch (https://github.com/0cc4m/KoboldAI) i can load a model without going out of RAM or paging massive amounts to swap. But, if i try to load the same file in Ooba, it errors out after paging out 140GB of Swap.
This is using the same conda environment and manually running either.
Model i'm trying to load is a Safetensor. Doesnt matter if its the 30B GPT4-X or the newest openassist 30B.
Oh, and using 0cc4m's GPTQ branch in both. (and i have tested with Ooba's GPTQ branch and it doesnt make a difference)
Weirdly, there is a spike of SWAP usage before line 47: model = model.eval() in GPTQ_Loader.py, however nothing appears to be written to SWAP, as my SSD doesnt start to write until the model starts the following loading model... step.
I'm not good enough to figure out what the difference is. But there is proof that large models can be loaded without blowing out swap.
kobold uses lazyload.
could we have a mmap option ? on my laptop there is a 13B model that llamma.cpp can run, but ooba cannot because i run out of ram (not vram) and get oom'd.
I thought that is a function of the CPP python module.
@Ph0rk0z yes it is, but it would be nice to be able to use it on the non cpp version as well. it should be possible to copy the model directly to vram without first having to load it fully in ram. it kinda sucks for low ram high vram systems.
I assume the person who wrote the wrapper has to update it. I actually want to turn mmap off and see if I get faster speeds loading the whole thing in ram.
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.