exllama
exllama copied to clipboard
Is there any way to support multiple parallel generation request to the same model?
I'm kind of a newbie and probably it's not the right thing to ask, but maybe I can get pointed to the right direction. I have a FastAPI server and I would love to be able to handle multiple generation requests at the same time.
I don't know if it is something that can be done at the library level or it's something that requires a specific architecture, yet again I'm trying to figure this out and maybe get some help here.
At the moment I only have a queue of requests so that I can satisfy them one by one as soon as possible.
You can process in batches if you have enough VRAM to allocate the cache with a larger batch size. There's an example of how this works in example_batch.py, using generate_simple. It's not thoroughly tested and optimized, but it should still be substantially faster to generate in batches, except in edge cases where one prompt is very short and another is very long etc.
I've seen this example, but what If I don't have all the prompts at the same time. For example let's say I receive a request and start generating the output and while the model it's generating, now I get a second request. What do I do now?
Well, the implementation isn't threadsafe, and you wouldn't want two threads both trying to put a 100% load on the GPU anyway. Batching is great, though, because generating two replies simultaneously isn't that much slower than generating one reply. It does require batches to be processed in parallel, though.
You could have multiple generators and just call them in turn, generating one token at a time from each, with the approach from generate_simple. That should work, although of course much slower than using batches.
Out of curiosity, what would be more efficient? Batch processing a bunch of prompts or running multiple instances of exllama? Seems like batch processing would be largely better instead of using the vRAM to load the model up in vRAM multiple times? But then again at a certain point it might get slower per prompt and take up too much vRAM so its hard to say -- I guess it depends on hardware and figuring out balance between speed and vRAM usage for multiple instances vs batching -- I'll do some testing
Batch processing is always going to be way faster and use less VRAM than running multiple instances of the model, or running the same model on multiple sequences in a multiplexed fashion. At least in terms of throughput. It's harder to optimize for latency when the throughput is low, of course.
@turboderp Did some basic testing on 1x3090 with 13B model, iterating over prompts in unbatched mode:
| Batch Size | Un-batched Speed (s) | Batch Speed (s) |
|---|---|---|
| 4 | 13.6 | 9.1 |
| 8 | 26 | 16.8 |
| 10 | 32 | 17.1 |
Note at batch size 10, max vRAM usage for batch: 24GB; but individual was only 10GB. ie. Isn't it true that running two unbatched instances in parallel on a 1x3090 with the 13B model provides better/similar throughput and better latency than running a single batched instance?
- 10 prompts / 17.1 seconds = 0.58 prompts/second (batched)
- 10 prompts / 32 seconds = 0.31 prompts/second (single instance unbatched)
- 2 * 0.31 prompts/second = 0.62 prompts/second (two instances running at same time)
Is this expected? Or am I doing something wrong?
prompts = [
"Once upon a time," * 10,
"I don't like to" * 10,
"A turbo encabulator is a" * 10,
"In the words of Mark Twain," * 10,
"Once upon a time," * 10,
"I don't like to" * 10,
"A turbo encabulator is a" * 10,
"In the words of Mark Twain," * 10,
"Once upon a time," * 10,
"I don't like to" * 10,
]
# Generate, un-batched
import time
start_time = time.time()
for prompt in prompts:
output = generator.generate_simple(prompt, max_new_tokens=200)
print("---")
print(output)
end_time = time.time()
time_taken = end_time - start_time
print(f"Time taken to generate response for NON BATCHED MODE: {time_taken} seconds")
import subprocess
command = "nvidia-smi"
result = subprocess.run(command, shell=True, capture_output=True, text=True)
print(result.stdout)
# Generate, batched
import time
start_time_batch = time.time()
output = generator.generate_simple(prompts, max_new_tokens=200)
end_time_batch = time.time()
time_taken = end_time_batch - start_time_batch
for line in output:
print("---")
print(line)
print(f"Time taken to generate responses in BATCH MODE: {time_taken} seconds")
import subprocess
command = "nvidia-smi"
result = subprocess.run(command, shell=True, capture_output=True, text=True)
print(result.stdout)
2 * 0.31 prompts/second = 0.62 prompts/second (two instances running at same time)
But how would you get double the speed when running two instances at the same time? With a second GPU?
On a side note, you can try tweaking some of the tuning parameters for potential performance gains. For this particular example I go from 12.2s to 9.0s on the 10-batch by setting:
config.matmul_recons_thd = 12
config.fused_mlp_thd = 12
config.sdp_thd = 12
YMMV of course. Depends on the GPU, CPU, etc.
But how would you get double the speed when running two instances at the same time? With a second GPU?
Running 2x instances of exllama on the same GPU? So instead of one running on the GPU and processing 10 prompts with 24GB vRAM we can run 2x instances for the same vRAM with similar throughput and lower latency right? Will look into the params
If you try to run two at the same time they will compete for CUDA cores and memory bandwidth. Best case scenario, they'll both be running at half speed, but more likely there will be some scheduling hiccups. And they can't both make full use of the L2 cache at the same time, so overall it will be quite a bit slower than just using one instance to alternate between two generators.
That's fair, my testing was kinda bad and I was only using the two instances one at a time, I can see your situation spanning out under load when both are generating at the same time, batch is definitely better as you mentioned.