text-generation-webui
text-generation-webui copied to clipboard
New streaming method (much faster)
It uses a trick to create an iterator from stopping_criteria
.
This way, it is not necessary to call model.generate
multiple times, making things a lot (really, a lot) faster.
Currently broken in chat mode.
It's mostly working, but with
- chat mode
--cai-chat
-
stop generating at new line character?
unset
there is a memory leak that causes VRAM usage to skyrocket.
Suggestion: When running inference of llama 13b using this branch, I've encountered OOM issue when running the command python server.py --cai-chat --auto-devices --gpu-memory "3", which never occurred using the main branch. Maybe adding a function that detects GPU memory usage and "flushes" vram when needed could prevent such thing from happening?
It's mostly working, but with
* chat mode `--cai-chat` * `stop generating at new line character?` unset
there is a memory leak that causes VRAM usage to skyrocket.
I'm getting a memory leak even without --cai-chat, just generating and stopping with LLaMA 13B 4bit on an RTX 3060 12GB skyrockets the usage from 8GiB to OOM after 5-6 generations. Usage jumps several hundred MiB each run.
Maybe adding a function that detects GPU memory usage and "flushes" vram when needed could prevent such thing from happening?
Such function already exists in text_generation.py
:
def clear_torch_cache():
gc.collect()
if not shared.args.cpu:
torch.cuda.empty_cache()
I have tried adding calls to clear_torch_cache()
everywhere inside generate_reply()
and the memory still leaks...
~~Update: Waiting about 20 seconds between runs seems to prevent/reduce the chance of the VRAM usage skyrocketing, perhaps it takes time to clear cache after generating, and running generation interrupts that?~~
Edit: Nevermind, looks like I jumped the gun about this. VRAM usage still skyrocketing, not sure if it's jumping less when taking 20 sec pauses.
I have made some progress: the memory leak seems to disappear if
use_cache=False
is added as one of the parameters for model.generate(...)
.
The cost is that the generation speed went from 30 it/s to 20 it/s in a deterministic test.
What this means or how to fix it, I don't know.
This is so annoyinggggggggggggggggggggggggg
Tested the latest commit, seems that VRAM usage is no longer skyrocketing. The streaming is reasonably fast.
lgtm
@David-337 the memory leak was gone but the performance was much slower, even slower than the current streaming implementation. I have reset all changes to the latest commit that was still fast.
Iteratorize is taken from here: https://stackoverflow.com/a/9969000
In another comment in the same thread, it is said that
note that, unless all items are eventually retrieved by this generator, the created thread will deadlock (it will block forever and its resources will never be released). The producer is waiting on the queue, and since it stores a reference to that queue, it will never be reclaimed by the gc even if the consumer is. The queue will then become unreachable, so nobody will be able to release the lock.
A clean solution for that is unknown
I guess that's the problem. If the iterator generated by Iteratorize doesn't reach its end, memory is lost forever, in our case inside the GPU.
the fix works, but it seems to decrease the generation speed when offloading to ram? I was previously getting 0.4+ token / s when running llama 13b with --gpu-memory "3", and now I'm getting 0.33 token / s. The decrease in gen speed is not caused by context length.
Can you update and try again? The previous fix wasn't really a fix in the end, but this third attempt seems to be robust. 0bd5430
Seems like performance improved from last branch, lgtm.
This seems to work and make sense now.
YOLO moment then, let's merge.