Does Cog support batched inference?
How do we figure out scaling requests using a COG? As I understand it, since the models use a GPU and only one process can use a GPU, how do we scale for 100s of requests / second?
Is there anyway COGs supports for batched inference?
There's no built in support for batching yet, though there's an open issue for it: #612. Batching would help with with throughput in some cases, though beyond a certain point you'll need more instances to be able to scale further.
For Replicate we scale by running multiple instances of the Cog model to handle concurrent requests, using the built in queue worker (though we're in the middle of re-architecting that worker in #870).
Can you provide details or point to specific documentation detailing the built-in queue worker?
The README.md says: "automatic worker queues" while the HTTP doc says: "that while this allows clients to create predictions 'asynchronously,' Cog can only run one prediction at a time, and it is currently the caller's responsibility to make sure that earlier predictions are complete before new ones are created."
I'm confused
cc @technillogue @joehoover @daanelson
Hi, can we get a comment on this? I am also confused by contradictory information found in the repo:
From the README:
🥞 Automatic queue worker. Long-running deep learning models or batch processing is best architected with a queue. Cog models do this out of the box. Redis is currently supported, with more in the pipeline.
From docs/redis.md:
Note: The redis queue API is no longer supported and has been removed from Cog.
From my own testing running Cog standalone, given two concurrent requests, the second one will be queued and execute after the first.
However this again is in contradiction to the advice in the README.
So please confirm, is there a built-in in-memory queue out-of-the-box with Cog? If so, what are the limits of this queue, and if not, please explain how my two concurrent requests apparently execute in sequence.
Thank you
Hi, any comment here @zeke @technillogue @daanelson ?
Also interested. I'd like to clearly understand what's happening under the hood when 2 requests are sent to Cog.
The current documentation is a bit unclear, and I'd also like to know if there is an endpoint to know how many predictions are currently "queued" and what are the limit of this queue.
Thanks a lot, appreciate it
hi I've been testing Cog server and stress testing it on my machine. I see that all request are blocking and even when I use the put method with async headers it processes in the background sequentially as well. I'm trying to figure out how to do batching and although the front page of the readme says you support batching in the queue worker I don't see any documentation for it.
this is a supported feature on the front page of the github but i dont see any info in the docs or by looking at the code itself
🥞 Automatic queue worker. Long-running deep learning models or batch processing is best architected with a queue. Cog models do this out of the box. Redis is currently supported, with more in the pipeline.
Just an FYI to the team: the lack of support as demonstrated in this ticket (including a parallel open support ticket) has made us reconsider using Cog in our operation.