vllm [Feature][P0]: Switch to Runtime Base Image

🚀 The feature, motivation and pitch

Description

The Dockerfile currently uses nvidia/cuda:12.9.1-devel-ubuntu22.04 as the final base image. The devel variant includes the full CUDA compiler toolchain (~7GB) which is only needed during build, not at runtime. Switching to the runtime variant will significantly reduce image size.

What You'll Do

Change FINAL_BASE_IMAGE from devel to runtime (line 24)
Analyze if any runtime components actually need build tools
Handle FlashInfer JIT compilation requirements:
- Test if AOT wheels work without build deps
- If needed, add conditional minimal build tools
Verify all GPU functionality works with runtime image
Update documentation

Deliverables

[ ] Modified Dockerfile with runtime base image
[ ] Conditional build dependency installation for FlashInfer (if needed)
[ ] GPU functionality test results
[ ] Before/after image size comparison

Alternatives

No response

Additional context

No response

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Nov 13 '25 12:11 rzabarazesh

Given DeepGEMM seems to be jitting kernels, would that not require it to AOT compile kernels as well?

Nov 13 '25 16:11 bbartels

Either way, this would have to be merged first: https://github.com/vllm-project/vllm/pull/26966 I'll fix up merge conflicts as soon as v0.11.1 is out, as I was told to hold off until then

Nov 13 '25 16:11 bbartels

@bbartels I'm going to split the aot vs jit out of this ticket. For now simply switching the base image to runtime and installing the tools and headers we need explicitly should still save us a bunch of space. Something like https://github.com/vllm-project/vllm/pull/28727

Nov 14 '25 17:11 rzabarazesh

@bbartels I'm going to split the aot vs jit out of this ticket. For now simply switching the base image to runtime and installing the tools and headers we need explicitly should still save us a bunch of space. Something like https://github.com/vllm-project/vllm/pull/28727

Sounds good, I'll fix up the source compilation pr later today. That should save some space as well!

Nov 14 '25 17:11 bbartels

https://github.com/vllm-project/vllm/pull/28727 is ready for review and saves 3 GB of space.

Nov 14 '25 19:11 rzabarazesh