How to Run Conccurent Inferences?
Hi,
Congrats on the launch of Ovis2. It is amazing!
Question, please. Cloud you please let me know how to run concurrent inferences?
I am not talking about batch processing. Batch processing is when you send several requests at once.
What I am looking for is how to send a request while another request is being processed and process it without waiting for the other to finish processing:
- Request A is being processed
- Request B is received shortly after A, and it immediately start getting processing while A is also being processed (no waiting)
Please let me the best approach to achieve this and to maximize efficiency.
Appreciate it!
Thanks!
I tested the batch inference as well, and did not see any performance improvement. It was approx same duration as sequential inference, if not a little slower. I am using the 8B by the way.
And I tested with an RTX 5000 ADA, which according to GPT should not be the bottleneck.
It was printing around 30 tokens/sec which is slow.
Any advice on how to improve performances without loading several instances (which is not memory efficient)?
Also, does Ovis2 work with vLLM?
Thanks!