How to Run Conccurent Inferences?

Open kabal1000x opened this issue 10 months ago • 1 comments

Hi,

Congrats on the launch of Ovis2. It is amazing!

Question, please. Cloud you please let me know how to run concurrent inferences?

I am not talking about batch processing. Batch processing is when you send several requests at once.

What I am looking for is how to send a request while another request is being processed and process it without waiting for the other to finish processing:

Request A is being processed
Request B is received shortly after A, and it immediately start getting processing while A is also being processed (no waiting)

Please let me the best approach to achieve this and to maximize efficiency.

Appreciate it!

Thanks!

Feb 22 '25 04:02 kabal1000x

I tested the batch inference as well, and did not see any performance improvement. It was approx same duration as sequential inference, if not a little slower. I am using the 8B by the way.

And I tested with an RTX 5000 ADA, which according to GPT should not be the bottleneck.

It was printing around 30 tokens/sec which is slow.

Any advice on how to improve performances without loading several instances (which is not memory efficient)?

Also, does Ovis2 work with vLLM?

Thanks!

Feb 23 '25 15:02 kabal1000x