vllm [Feature]: Data parallel inference in offline mode(based on Ray)

🚀 The feature, motivation and pitch

I've been building model evaluation datasets using offline inference as outlined in the documentation, and I noticed that it’s challenging to fully leverage all available GPUs—when the model fits on a single GPU.

To overcome this, I implemented a feature that distributes model replicas across different GPUs, allowing prompt data to be processed concurrently. For large datasets, this approach achieves nearly linear speedup, significantly enhancing performance for both my team and me.

It’s important to note that offline inference also plays a crucial role in model training and evaluation. By enabling efficient and scalable processing of evaluation data, offline inference helps in thoroughly benchmarking models and fine-tuning them during the development cycle.

Interestingly, this feature has been discussed before (see issue #1237), yet there hasn't been any implementation so far. I’m curious if others still find this feature useful for offline inference, as it would eliminate the need to launch multiple vLLM API services or develop a multi-threaded HTTP request program to fully utilize GPU resources. I’d be happy to contribute this enhancement.

Note: Currently, this feature is available only for offline inference, but I’m open to discussing adaptations for online mode if there’s enough interest.

启动多个 Ray 进程来支持数据并行，减少大数据量离线推理的使用成本

Alternatives

No response

Additional context

No response

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Mar 12 '25 14:03 re-imagined

We also recently introduced Ray Data LLM features which should simplify this greatly?

https://docs.ray.io/en/latest/data/working-with-llms.html

Mar 21 '25 21:03 richardliaw

We also recently introduced Ray Data LLM features which should simplify this greatly?

https://docs.ray.io/en/latest/data/working-with-llms.html

yeah, I love this feature. But I am thinking about making vLLM support DP natively

Mar 25 '25 17:03 re-imagined

I agree there is a need for a simpler, ray-less offline inference pattern/example using regular python's multiprocessing

The example data_parallel.py is currently not very generic:

https://github.com/vllm-project/vllm/issues/1237#issuecomment-2910167215

May 27 '25 10:05 vadimkantorov

Hi,

I am trying to understand the motivation.

yeah, I love this feature. But I am thinking about making vLLM support DP natively

Coming from the Kubernetes world, it seems to me having the orchestration framework such as k8s to manage the scaling and resource allocation for individual vLLM API servers is the clean approach. When you manage multiple processes, you start thinking about resource limits, autoscaling, etc., which orchestration frameworks are meant to do that.

Is the main use case here for single node deployments where using a orchestration framework is overkill? For large scale production deployments, how do you see the data parallel feature work with simply scaling out more vLLM replicas without data parallel?

Jun 11 '25 00:06 liu-cong

I think both usecases are meaningful:

when you're already using ray for managing the pipeline, it can make sense to manage vllm DP replicas using ray as well (as ray managing/monitoring already in place)
when you just want to make use of all GPUs of your single-node for simple python script offline inference batch-processing a bunch of prompts/inputs or vllm serve inference (to use all of single node's resrouces), it's useful to also have some basic support in vllm not requiring torchrun or ray or manual management of GPUs assignment. Like in Python's philosophy: that simple usecase/thing should be done easily (and not require steep learning curve forcing to learn/manage ray cluster for the simple single-node usecase)

Jun 11 '25 11:06 vadimkantorov

Hi,

I am trying to understand the motivation.

yeah, I love this feature. But I am thinking about making vLLM support DP natively

Coming from the Kubernetes world, it seems to me having the orchestration framework such as k8s to manage the scaling and resource allocation for individual vLLM API servers is the clean approach. When you manage multiple processes, you start thinking about resource limits, autoscaling, etc., which orchestration frameworks are meant to do that.

Is the main use case here for single node deployments where using a orchestration framework is overkill? For large scale production deployments, how do you see the data parallel feature work with simply scaling out more vLLM replicas without data parallel?

Yes, for large production setups, Kubernetes is the best choice: Deploy multiple vLLM pods Add a load balancer Set up a domain name

This allows data-parallel inference by just adding more pods when you have enough resources. But this needs: 1.Full Kubernetes setup 2.Kubernetes knowledge 3.Complicated preparation work 4.Extra effort - You also need to write multi-threaded client code to call the service, which makes it harder to implement.

Now think about this common situation: You just want to quickly test a fine-tuned LLM (not too big) on a single machine with ≤8 GPUs. If vLLM could handle data-parallel by itself, it would save lots of time. This is what many users actually need for daily experiments.

Jun 12 '25 03:06 re-imagined

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Sep 24 '25 02:09 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Oct 24 '25 02:10 github-actions[bot]

Any updates? @re-imagined @vadimkantorov

Nov 04 '25 09:11 ImmortalSdm

vllm vllm copied to clipboard

[Feature]: Data parallel inference in offline mode(based on Ray)

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

vllm
vllm copied to clipboard