exllama icon indicating copy to clipboard operation
exllama copied to clipboard

Question about example_flask.py

Open ZeroYuJie opened this issue 2 years ago • 1 comments

I found an example regarding using Flask for API requests. I gave it a try, but when making concurrent requests, the generated responses from the inference appear as garbled text. I suspect this might be due to concurrent inference for two questions. Is it possible to perform answer generation concurrently?

ZeroYuJie avatar Aug 08 '23 12:08 ZeroYuJie

There's no support for concurrency, no. You'd need a separate instance for each thread, with its own generator and cache, and some mechanism for sensibly splitting the work between threads, given that the implementation completely occupies the GPU.

You could possibly have a streaming API that dispatches to multiple generators when there are concurrent requests, but you'd need a lot of VRAM to accommodate that.

turboderp avatar Aug 08 '23 15:08 turboderp