[how-to-run-inference-cloud-run-gpu-vllm]:
This code lab does not work.
I get all the way to the end of the tutorial. But the final step of deploying the container fails every time I try it.
This is the error:
Creating Revision...failed
Deployment failed
ERROR: (gcloud.beta.run.deploy) The user-provided container failed to start and listen on the port defined provided by the PORT=8000 environment variable within the allocated timeout. This can happen when the container port is misconfigured or if the timeout is too short. The health check timeout can be extended. Logs for this revision might contain more information.
I have tried extending the health check timeouts, but then I get error messages telling me that 240 is the maximum value. Here is some examples of those error messages.
Deployment failed
ERROR: (gcloud.beta.run.deploy) service.spec.template.spec.containers[0].startup_probe: period_seconds must be a number between 0 and 240.
Deployment failed
ERROR: (gcloud.beta.run.deploy) service.spec.template.spec.containers[0].startup_probe.timeout_seconds: must be less than period_seconds.
My best bet at this time is that the problem is caused by an incompatibility in CUDA support
The vLLM containers are built for CUDA 12.4 but Google Cloud Run only support CUDA 12.2
Thank you for the feedback and the heads up the codelab is broken! I just gave it a try. Looks like the latest VLLM images are using more GPUs now. If you scroll up in your logs, you'll see that the reason why the container failed to start is that it had a OOM exception for GPUs.
Here's the quick fix that worked for me:
Update your dockerfile to use these new paramaters (bolded emphasis mine):
ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server
--port ${PORT:-8000}
--model ${MODEL_NAME:-google/gemma-2-2b-it}
--gpu-memory-utilization 0.85
--max-num-seqs 256
--max-model-len 4096
I'll also pin VLLM to the current version and will update the codelab shortly.
I've updated the codelab with those new variables and to pin the version vLLM to v0.11.0 which currently works. I've also updated a couple of hard-coded us-central1 region typos.
https://codelabs.developers.google.com/codelabs/how-to-run-inference-cloud-run-gpu-vllm#0
Thanks again for the feedback!
Thanks for the rapid response and solution
Much appreciated @saraford