text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Support models larger than RAM / VRAM

Open Rudd-O opened this issue 1 year ago • 16 comments

Description

A clear and concise description of what you want to be implemented. indicates it should be possible to execute models larger than the RAM available on the GPU.

Can oobabooga implement this or an analogue of this?

Thanks in advance! Your project is extremely cool. I hope to continue pushing new PRs as I continue to discover more and more about this project.

EDIT: this ticket is more or less about either documenting this facility in a prominent place, or making one of these modes the default and document how to disable or change mode.

Rudd-O avatar Apr 19 '23 13:04 Rudd-O

Offloading already works in this repo. With deepspeed, flexgen, and accelerate.

Ph0rk0z avatar Apr 19 '23 15:04 Ph0rk0z

Offloading already works in this repo. With deepspeed, flexgen, and accelerate.

Does accelerate (--auto-devices) uses lazy loading for big models, which are lager than RAM? Like llamacpp with mmap. I've tried to use it, but server.py exits probably because of OOM. No errors btw.

mrdc avatar Apr 19 '23 18:04 mrdc

It probably depends on how you sharded the model too. If it's in 3 chunks yea.. might be a problem.

Ph0rk0z avatar Apr 19 '23 18:04 Ph0rk0z

It probably depends on how you sharded the model too.

Oh, it's clear now: my model is not sharded at all. I thought that --auto-devices or deepspeed load models using some "magic" under the hood)

BTW, I'm using quatized models, which are not sharded by default. Does it mean no go with deepspeed and accelerate?

mrdc avatar Apr 19 '23 18:04 mrdc

Quantized will load in accelerate but not deepspeed.

Ph0rk0z avatar Apr 20 '23 12:04 Ph0rk0z

Quantized will load in accelerate

Ok, I'll give it a try one more time. Somehow accelerate suddenly stopped working - I don't see the regular message where it says how much VRAM/RAM accelerate assigned.

mrdc avatar Apr 20 '23 15:04 mrdc

python server.py --auto-devices --share --wbits 4 --groupsize 128 --model_type llama --listen --model xxxxxx-safetensor

Trying to load a model. which is > RAM and server.py exits because of OOM

mrdc avatar Apr 20 '23 15:04 mrdc

If even the quantized model is > than your ram you will have to make a big swap file and load it that way. You can also use pre-layer for llama.

Ph0rk0z avatar Apr 21 '23 11:04 Ph0rk0z

Note: at least on my machine, DeepSpeed is no faster than --auto-devices with or without --8bits.

Rudd-O avatar Apr 22 '23 14:04 Rudd-O

I'm seeing the same behavior. Big safetensor file (MetalX/GPT4-X-Alpaca-30B-4bit), relevant settings like @mrdc's previous comment, crash with no error message in model.load_state_dict. I have 32GB of CPU RAM, 24GB of VRAM, and a big swap file of 128GB. It should work, not sure why it isn't...

HWiese1980 avatar Apr 23 '23 10:04 HWiese1980

Weirdly, If using 0cc4m's KoboldAI latestgptq branch (https://github.com/0cc4m/KoboldAI) i can load a model without going out of RAM or paging massive amounts to swap. But, if i try to load the same file in Ooba, it errors out after paging out 140GB of Swap.

This is using the same conda environment and manually running either.

Model i'm trying to load is a Safetensor. Doesnt matter if its the 30B GPT4-X or the newest openassist 30B.

Oh, and using 0cc4m's GPTQ branch in both. (and i have tested with Ooba's GPTQ branch and it doesnt make a difference)

Weirdly, there is a spike of SWAP usage before line 47: model = model.eval() in GPTQ_Loader.py, however nothing appears to be written to SWAP, as my SSD doesnt start to write until the model starts the following loading model... step.

I'm not good enough to figure out what the difference is. But there is proof that large models can be loaded without blowing out swap.

askmyteapot avatar Apr 23 '23 15:04 askmyteapot

kobold uses lazyload.

Ph0rk0z avatar Apr 23 '23 16:04 Ph0rk0z

could we have a mmap option ? on my laptop there is a 13B model that llamma.cpp can run, but ooba cannot because i run out of ram (not vram) and get oom'd.

alkeryn avatar May 12 '23 22:05 alkeryn

I thought that is a function of the CPP python module.

Ph0rk0z avatar May 13 '23 12:05 Ph0rk0z

@Ph0rk0z yes it is, but it would be nice to be able to use it on the non cpp version as well. it should be possible to copy the model directly to vram without first having to load it fully in ram. it kinda sucks for low ram high vram systems.

alkeryn avatar May 13 '23 13:05 alkeryn

I assume the person who wrote the wrapper has to update it. I actually want to turn mmap off and see if I get faster speeds loading the whole thing in ram.

Ph0rk0z avatar May 13 '23 14:05 Ph0rk0z

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

github-actions[bot] avatar Aug 30 '23 23:08 github-actions[bot]