feedback icon indicating copy to clipboard operation
feedback copied to clipboard

[how-to-run-inference-cloud-run-gpu-vllm]:

Open john-hawkins opened this issue 2 months ago • 4 comments

This code lab does not work.

I get all the way to the end of the tutorial. But the final step of deploying the container fails every time I try it.

This is the error:

Creating Revision...failed                                                                                                                                         
Deployment failed                                                                                                                                                    
ERROR: (gcloud.beta.run.deploy) The user-provided container failed to start and listen on the port defined provided by the PORT=8000 environment variable within the allocated timeout. This can happen when the container port is misconfigured or if the timeout is too short. The health check timeout can be extended. Logs for this revision might contain more information.

I have tried extending the health check timeouts, but then I get error messages telling me that 240 is the maximum value. Here is some examples of those error messages.

Deployment failed                                                              
ERROR: (gcloud.beta.run.deploy) service.spec.template.spec.containers[0].startup_probe: period_seconds must be a number between 0 and 240.

Deployment failed                                                              
ERROR: (gcloud.beta.run.deploy) service.spec.template.spec.containers[0].startup_probe.timeout_seconds: must be less than period_seconds.

john-hawkins avatar Nov 11 '25 22:11 john-hawkins

My best bet at this time is that the problem is caused by an incompatibility in CUDA support

The vLLM containers are built for CUDA 12.4 but Google Cloud Run only support CUDA 12.2

john-hawkins avatar Nov 12 '25 00:11 john-hawkins

Thank you for the feedback and the heads up the codelab is broken! I just gave it a try. Looks like the latest VLLM images are using more GPUs now. If you scroll up in your logs, you'll see that the reason why the container failed to start is that it had a OOM exception for GPUs.

Here's the quick fix that worked for me:

Update your dockerfile to use these new paramaters (bolded emphasis mine):

ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server
--port ${PORT:-8000}
--model ${MODEL_NAME:-google/gemma-2-2b-it}
--gpu-memory-utilization 0.85
--max-num-seqs 256
--max-model-len 4096

I'll also pin VLLM to the current version and will update the codelab shortly.

saraford avatar Nov 12 '25 19:11 saraford

I've updated the codelab with those new variables and to pin the version vLLM to v0.11.0 which currently works. I've also updated a couple of hard-coded us-central1 region typos.

https://codelabs.developers.google.com/codelabs/how-to-run-inference-cloud-run-gpu-vllm#0

Thanks again for the feedback!

saraford avatar Nov 13 '25 17:11 saraford

Thanks for the rapid response and solution

Much appreciated @saraford

john-hawkins avatar Nov 19 '25 22:11 john-hawkins