replicate-python Feature request: custom autoscaling parameters

Hi, I'm working on deploying a private language model to production through Replicate. I have requests coming in sporadically so provisioning always-on servers is not feasible for me, but I would like requests to be handled at my max concurrency for increased speed. Currently I face 2-2.5 minutes of cold start for each instance and they terminate after 1 minute each, which can lead to some frustrating delays that are longer than necessary. Would it be possible to add either of these functionalities?

API to force boot n instances together: reduces the spread of boot time, more control to start the boot process early before requests actually need to be processed
Custom idle time limits: this needs to be at least as long as the boot time. I wouldn't mind having to pay for some extra uptime if it meant I don't have stop-start behaviour in the middle of a chunk being processed.

Currently I'm attempting a workaround for no.1 by burst-pinging the model early with the default input n times, but the short idle time means that there's still a good chance that the instances get terminated before I send any actual requests. Let me know if you have a better solution. Thanks!

Jan 09 '24 05:01 Varun2101

Hi, @Varun2101. Thanks for sharing this feedback.

To your second point, you can get more control over the behavior of a model on Replicate by creating a deployment. I don't believe we provide a way to configure the timing for autoscaling a deployment, but that's something we've discussed.

Jan 30 '24 18:01 mattt

I've built a workaround that simply pings the model (to minimize tokens, I ask it to respond with a single character). This takes .1 seconds of runtime, and I set the ping frequency to 1 minute. After 15 minutes of inactivity the pings stop. While it doesn't solve replicate's initial cold boot issue, this keeps the model warm while a user is active with negligible cost - far cheaper than created a dedicated deployment.

Oct 02 '24 13:10 nathan-eagle