ray
ray copied to clipboard
[Feature] ray serve + model parallel deepspeed inference
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
Hi folks,
I'm trying to split a model across multiple gpus within ray serve using deepspeed inference.
I believe it boils down to the new ray processes not being started with the deepspeed launcher.
Could what the launcher does somehow be forwarded to child processes spawned through ray?
Ideally for max scalability, this should work both on one node with multiple GPUs as well as across multiple nodes for very large models. Some ideas for how this UX could be implemented:
- serve.start(deepspeed_launcher=True)
- a new deepspeedray launcher to launch the serve script with deepspeed's launcher but propagate it.
Some code to demonstrate the use case import time from typing import List, Dict, Any
import deepspeed
import torch
from ray import serve
from torch import nn
import time
from typing import List, Dict, Any
import deepspeed
import torch
from ray import serve
from torch import nn
N_GPUS = 2
###################################################
# -- the snippet below works if launched with 'deepspeed serve_multigpu_deepspeed_min_repro.py'
# but NOT with 'python3 serve_multigpu_deepspeed_min_repro.py'
# nn_torch = nn.Linear(5, 5)
# nn_ds = deepspeed.init_inference(
# model=nn_torch,
# mp_size=N_GPUS,
# dtype=torch.float16,
# replace_method=False,
# replace_with_kernel_inject=True,
# )
# exit()
###################################################
# -- this doesnt work with either. Maybe because what the deepspeed launcher does is likely
# not propagated on to the new processes?
@serve.deployment(
name="DeepspeedNetMPNet",
_autoscaling_config={
"min_replicas": 1,
"max_replicas": 1,
},
ray_actor_options={"num_gpus": N_GPUS, "num_cpus": 1},
version="0.0.1",
)
class Net:
def __init__(self):
self.nn_torch = nn.Linear(64, 64)
self.nn_ds = deepspeed.init_inference(
model=self.nn_torch,
mp_size=N_GPUS,
dtype=torch.float16,
replace_method=False,
replace_with_kernel_inject=True,
)
# @serve.batch(max_batch_size=4)
async def __call__(self,
requests_batch: List[torch.Tensor],
) -> List[Dict[str, Any]]:
with torch.no_grad():
return self.nn_ds(torch.stack(requests_batch))
serve.start()
Net.deploy()
# Serve will be shut down once the script exits, so keep it alive manually.
while True:
time.sleep(5)
print("Deployments:")
for k, v in serve.list_deployments().items():
print(f"{k} | {v}\n")
Use case
Multi-GPU model parallelization helps speed up large-neural-net APIs.
Examples:
- Single node, multiple GPUs for large neural nets
- Very large neural nets across multiple nodes with multiple GPUs each
- Speeding up inference for speed-critical API tasks for smaller/medium-scale NNs
Related issues
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Hi @EricSteinberger, thanks for providing context for the issue. You're testing a workload that we think is critical but haven't invested much yet, therefore some edges could be rough. I haven't got my hands on DeepSpeed before but after checking into the launcher script I think you would probably have easier time to prototype directly with Ray Actor APIs to spawn and coordinate child processes via actors. Then for advanced communication pattern via NCCL you might find ray.collective helpful: https://docs.ray.io/en/latest/ray-more-libs/ray-collective.html, then move on to scale this distributed inference workload with ray serve.
We need to understand better about your workload first to be more effective -- i've sent you a message on Ray Slack.
Hi! I replied on Slack with more info about our usecase. Thank you for looking into the issue!
Hey @jiaodong,
I am also looking to use DeepSpeed + ray to inference using model parallelism.
Has the recommendation changed from above? I would very much appreciate if you could point me to the latest documentation for these concepts.
Thanks!