cog icon indicating copy to clipboard operation
cog copied to clipboard

Does Cog support batched inference?

Open akshhack opened this issue 2 years ago • 9 comments

How do we figure out scaling requests using a COG? As I understand it, since the models use a GPU and only one process can use a GPU, how do we scale for 100s of requests / second?

Is there anyway COGs supports for batched inference?

akshhack avatar Jan 23 '23 05:01 akshhack

There's no built in support for batching yet, though there's an open issue for it: #612. Batching would help with with throughput in some cases, though beyond a certain point you'll need more instances to be able to scale further.

For Replicate we scale by running multiple instances of the Cog model to handle concurrent requests, using the built in queue worker (though we're in the middle of re-architecting that worker in #870).

evilstreak avatar Jan 23 '23 13:01 evilstreak

Can you provide details or point to specific documentation detailing the built-in queue worker?

The README.md says: "automatic worker queues" while the HTTP doc says: "that while this allows clients to create predictions 'asynchronously,' Cog can only run one prediction at a time, and it is currently the caller's responsibility to make sure that earlier predictions are complete before new ones are created."

I'm confused

hervenivon avatar Apr 04 '24 23:04 hervenivon

cc @technillogue @joehoover @daanelson

zeke avatar Apr 05 '24 20:04 zeke

Hi, can we get a comment on this? I am also confused by contradictory information found in the repo:

From the README:

🥞 Automatic queue worker. Long-running deep learning models or batch processing is best architected with a queue. Cog models do this out of the box. Redis is currently supported, with more in the pipeline.

From docs/redis.md:

Note: The redis queue API is no longer supported and has been removed from Cog.

From my own testing running Cog standalone, given two concurrent requests, the second one will be queued and execute after the first.

However this again is in contradiction to the advice in the README.

So please confirm, is there a built-in in-memory queue out-of-the-box with Cog? If so, what are the limits of this queue, and if not, please explain how my two concurrent requests apparently execute in sequence.

Thank you

DarrenGrant-Storycraft avatar Jun 07 '24 17:06 DarrenGrant-Storycraft

Hi, any comment here @zeke @technillogue @daanelson ?

DarrenGrant-Storycraft avatar Jun 20 '24 15:06 DarrenGrant-Storycraft

Also interested. I'd like to clearly understand what's happening under the hood when 2 requests are sent to Cog.

The current documentation is a bit unclear, and I'd also like to know if there is an endpoint to know how many predictions are currently "queued" and what are the limit of this queue.

Thanks a lot, appreciate it

schankam avatar Jun 28 '24 13:06 schankam

hi I've been testing Cog server and stress testing it on my machine. I see that all request are blocking and even when I use the put method with async headers it processes in the background sequentially as well. I'm trying to figure out how to do batching and although the front page of the readme says you support batching in the queue worker I don't see any documentation for it.

brian316 avatar Aug 30 '24 17:08 brian316

this is a supported feature on the front page of the github but i dont see any info in the docs or by looking at the code itself

🥞 Automatic queue worker. Long-running deep learning models or batch processing is best architected with a queue. Cog models do this out of the box. Redis is currently supported, with more in the pipeline.

brian316 avatar Sep 11 '24 23:09 brian316

Just an FYI to the team: the lack of support as demonstrated in this ticket (including a parallel open support ticket) has made us reconsider using Cog in our operation.

DarrenGrant-Storycraft avatar Oct 01 '24 14:10 DarrenGrant-Storycraft