text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

New streaming method (much faster)

Open oobabooga opened this issue 1 year ago • 7 comments

It uses a trick to create an iterator from stopping_criteria.

This way, it is not necessary to call model.generate multiple times, making things a lot (really, a lot) faster.

Currently broken in chat mode.

oobabooga avatar Mar 08 '23 05:03 oobabooga

It's mostly working, but with

  • chat mode --cai-chat
  • stop generating at new line character? unset

there is a memory leak that causes VRAM usage to skyrocket.

oobabooga avatar Mar 09 '23 00:03 oobabooga

Suggestion: When running inference of llama 13b using this branch, I've encountered OOM issue when running the command python server.py --cai-chat --auto-devices --gpu-memory "3", which never occurred using the main branch. Maybe adding a function that detects GPU memory usage and "flushes" vram when needed could prevent such thing from happening?

Silver267 avatar Mar 10 '23 05:03 Silver267

It's mostly working, but with

* chat mode `--cai-chat`

* `stop generating at new line character?` unset

there is a memory leak that causes VRAM usage to skyrocket.

I'm getting a memory leak even without --cai-chat, just generating and stopping with LLaMA 13B 4bit on an RTX 3060 12GB skyrockets the usage from 8GiB to OOM after 5-6 generations. Usage jumps several hundred MiB each run.

David-337 avatar Mar 10 '23 12:03 David-337

Maybe adding a function that detects GPU memory usage and "flushes" vram when needed could prevent such thing from happening?

Such function already exists in text_generation.py:

def clear_torch_cache():
    gc.collect()
    if not shared.args.cpu:
        torch.cuda.empty_cache()

I have tried adding calls to clear_torch_cache() everywhere inside generate_reply() and the memory still leaks...

oobabooga avatar Mar 10 '23 12:03 oobabooga

~~Update: Waiting about 20 seconds between runs seems to prevent/reduce the chance of the VRAM usage skyrocketing, perhaps it takes time to clear cache after generating, and running generation interrupts that?~~

Edit: Nevermind, looks like I jumped the gun about this. VRAM usage still skyrocketing, not sure if it's jumping less when taking 20 sec pauses.

David-337 avatar Mar 10 '23 13:03 David-337

I have made some progress: the memory leak seems to disappear if

use_cache=False

is added as one of the parameters for model.generate(...).

The cost is that the generation speed went from 30 it/s to 20 it/s in a deterministic test.

What this means or how to fix it, I don't know.

oobabooga avatar Mar 11 '23 01:03 oobabooga

This is so annoyinggggggggggggggggggggggggg

oobabooga avatar Mar 11 '23 04:03 oobabooga

Tested the latest commit, seems that VRAM usage is no longer skyrocketing. The streaming is reasonably fast.

David-337 avatar Mar 11 '23 22:03 David-337

lgtm

Silver267 avatar Mar 11 '23 22:03 Silver267

@David-337 the memory leak was gone but the performance was much slower, even slower than the current streaming implementation. I have reset all changes to the latest commit that was still fast.

oobabooga avatar Mar 11 '23 23:03 oobabooga

Iteratorize is taken from here: https://stackoverflow.com/a/9969000

In another comment in the same thread, it is said that

note that, unless all items are eventually retrieved by this generator, the created thread will deadlock (it will block forever and its resources will never be released). The producer is waiting on the queue, and since it stores a reference to that queue, it will never be reclaimed by the gc even if the consumer is. The queue will then become unreachable, so nobody will be able to release the lock.

A clean solution for that is unknown

I guess that's the problem. If the iterator generated by Iteratorize doesn't reach its end, memory is lost forever, in our case inside the GPU.

oobabooga avatar Mar 11 '23 23:03 oobabooga

the fix works, but it seems to decrease the generation speed when offloading to ram? I was previously getting 0.4+ token / s when running llama 13b with --gpu-memory "3", and now I'm getting 0.33 token / s. The decrease in gen speed is not caused by context length.

Silver267 avatar Mar 12 '23 04:03 Silver267

Can you update and try again? The previous fix wasn't really a fix in the end, but this third attempt seems to be robust. 0bd5430

oobabooga avatar Mar 12 '23 05:03 oobabooga

Seems like performance improved from last branch, lgtm.

Silver267 avatar Mar 12 '23 05:03 Silver267

This seems to work and make sense now.

YOLO moment then, let's merge.

oobabooga avatar Mar 12 '23 06:03 oobabooga