worker-vllm Very slow cold starts even with flashboot

Hi, we're experiencing very slow cold starts even when activating flash boot, and this didn't happen before, with the same model architecture (Llama-3.1-8B) (different custom versions though). In fact, when it's about 1' since the last request, the worker appears initializing again, downloading weights etc. We've tried attaching a data storage to it, thinking that would lower cold start times (our hypothesis was that weights were downloaded there once and then just loaded from that disk at subsequent requests). However, that made things even worse, the delay times going up to 2' for a request. We're about to launch an AI-based app, and have been using Runpod for development for some months now, but these delay times are not acceptable for the app to run properly. We need to scale down to 0 to keep costs variable at the beginning until we have more customers (then we will use active workers, for sure). Can you please provide a solution for this? Is there some way to configure Runpod serverless endpoints so that delay times come back to where they were a week or two ago (1-2s)? @alpayariyak @Jorghi12 @pandyamarut @justinmerrell @carlson-svg @mikljohansson @casper-hansen @joennlae @willsamu @rachfop @vladmihaisima Thanks in advance.

Sep 18 '24 21:09 avacaondata

@avacaondata Sorry for this inconvenience. I can understand the frustration. We changed the flow a bit, for making it easy to update the vllm version but there's a model caching feature we are rolling out soon, this should solve it.

Model is indeed downloaded once and it's loaded from disk then, but let me check this as well. Alternatively, you can try with model baked in a docker image.

Sep 18 '24 22:09 pandyamarut

Thanks for the quick reply and for your understanding @pandyamarut :) Could you please guide us a little bit on best practices for deploying models effectively before that model caching feature release? We would really appreciate that. If we pre-build the docker container instead of using the web-interface vLLM template you provide, would that help? Or what other things can we do to improve cold-start times? Thanks again for checking this out, I hope this can be fixed soon so that we can still use Runpod for our production deployment, except for this, our experience with your cloud has been very nice so far.

Sep 19 '24 00:09 avacaondata

If you build the model as part of the Docker image, it’ll definitely help with the cold start. Loading the model from the local disk on the host server to GPU vRAM is faster than pulling it from a network volume. You can also extend the idle timeout a bit so the worker doesn’t go to sleep right after finishing a request, which helps avoid cold starts. And of course, keeping an active worker is the best way to prevent cold starts.

Sep 19 '24 01:09 Yhlong00

Hello is this being addressed?

Mar 07 '25 10:03 ComposerKevin

Also interested in this. Are custom docker images the best current way to speed up cold starts? I saw a PR that enabled model caching (#157) but got reverted soon after (#161)

Mar 24 '25 20:03 juancampa

The real issue is that it's slow to load (about 2-3 minutes for a 8b model) and the loading time COSTS ! it cost me about 50 cents each time it has to load the model... that's huge... and I'm pretty disapointed

May 20 '25 18:05 MiMillieuh

Any updates about this I tried network volumes but this will limiting my choice of GPUs too right?

Jul 29 '25 16:07 HoMi264

Any update on this? @pandyamarut why did your PR got reverted?

Aug 03 '25 11:08 AlexisMDP