vllm
vllm copied to clipboard
[Feature]: Data parallel inference in offline mode(based on Ray)
🚀 The feature, motivation and pitch
I've been building model evaluation datasets using offline inference as outlined in the documentation, and I noticed that it’s challenging to fully leverage all available GPUs—when the model fits on a single GPU.
To overcome this, I implemented a feature that distributes model replicas across different GPUs, allowing prompt data to be processed concurrently. For large datasets, this approach achieves nearly linear speedup, significantly enhancing performance for both my team and me.
It’s important to note that offline inference also plays a crucial role in model training and evaluation. By enabling efficient and scalable processing of evaluation data, offline inference helps in thoroughly benchmarking models and fine-tuning them during the development cycle.
Interestingly, this feature has been discussed before (see issue #1237), yet there hasn't been any implementation so far. I’m curious if others still find this feature useful for offline inference, as it would eliminate the need to launch multiple vLLM API services or develop a multi-threaded HTTP request program to fully utilize GPU resources. I’d be happy to contribute this enhancement.
Note: Currently, this feature is available only for offline inference, but I’m open to discussing adaptations for online mode if there’s enough interest.
启动多个 Ray 进程来支持数据并行,减少大数据量离线推理的使用成本
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
We also recently introduced Ray Data LLM features which should simplify this greatly?
https://docs.ray.io/en/latest/data/working-with-llms.html
We also recently introduced Ray Data LLM features which should simplify this greatly?
https://docs.ray.io/en/latest/data/working-with-llms.html
yeah, I love this feature. But I am thinking about making vLLM support DP natively
I agree there is a need for a simpler, ray-less offline inference pattern/example using regular python's multiprocessing
The example data_parallel.py is currently not very generic:
- https://github.com/vllm-project/vllm/issues/1237#issuecomment-2910167215
Hi,
I am trying to understand the motivation.
yeah, I love this feature. But I am thinking about making vLLM support DP natively
Coming from the Kubernetes world, it seems to me having the orchestration framework such as k8s to manage the scaling and resource allocation for individual vLLM API servers is the clean approach. When you manage multiple processes, you start thinking about resource limits, autoscaling, etc., which orchestration frameworks are meant to do that.
Is the main use case here for single node deployments where using a orchestration framework is overkill? For large scale production deployments, how do you see the data parallel feature work with simply scaling out more vLLM replicas without data parallel?
I think both usecases are meaningful:
- when you're already using ray for managing the pipeline, it can make sense to manage vllm DP replicas using ray as well (as ray managing/monitoring already in place)
- when you just want to make use of all GPUs of your single-node for simple python script offline inference batch-processing a bunch of prompts/inputs or
vllm serveinference (to use all of single node's resrouces), it's useful to also have some basic support in vllm not requiring torchrun or ray or manual management of GPUs assignment. Like in Python's philosophy: that simple usecase/thing should be done easily (and not require steep learning curve forcing to learn/manage ray cluster for the simple single-node usecase)
Hi,
I am trying to understand the motivation.
yeah, I love this feature. But I am thinking about making vLLM support DP natively
Coming from the Kubernetes world, it seems to me having the orchestration framework such as k8s to manage the scaling and resource allocation for individual vLLM API servers is the clean approach. When you manage multiple processes, you start thinking about resource limits, autoscaling, etc., which orchestration frameworks are meant to do that.
Is the main use case here for single node deployments where using a orchestration framework is overkill? For large scale production deployments, how do you see the data parallel feature work with simply scaling out more vLLM replicas without data parallel?
Yes, for large production setups, Kubernetes is the best choice: Deploy multiple vLLM pods Add a load balancer Set up a domain name
This allows data-parallel inference by just adding more pods when you have enough resources. But this needs: 1.Full Kubernetes setup 2.Kubernetes knowledge 3.Complicated preparation work 4.Extra effort - You also need to write multi-threaded client code to call the service, which makes it harder to implement.
Now think about this common situation: You just want to quickly test a fine-tuned LLM (not too big) on a single machine with ≤8 GPUs. If vLLM could handle data-parallel by itself, it would save lots of time. This is what many users actually need for daily experiments.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!
Any updates? @re-imagined @vadimkantorov