ray icon indicating copy to clipboard operation
ray copied to clipboard

[Feature] ray serve + model parallel deepspeed inference

Open EricSteinberger opened this issue 2 years ago • 3 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

Hi folks,

I'm trying to split a model across multiple gpus within ray serve using deepspeed inference.

I believe it boils down to the new ray processes not being started with the deepspeed launcher.

Could what the launcher does somehow be forwarded to child processes spawned through ray?

Ideally for max scalability, this should work both on one node with multiple GPUs as well as across multiple nodes for very large models. Some ideas for how this UX could be implemented:

  1. serve.start(deepspeed_launcher=True)
  2. a new deepspeedray launcher to launch the serve script with deepspeed's launcher but propagate it.

Some code to demonstrate the use case import time from typing import List, Dict, Any ​

import deepspeed
import torch
from ray import serve
from torch import nn
import time
from typing import List, Dict, Any
​
import deepspeed
import torch
from ray import serve
from torch import nn
​
N_GPUS = 2
###################################################
# -- the snippet below works if launched with 'deepspeed serve_multigpu_deepspeed_min_repro.py'
#    but NOT with 'python3 serve_multigpu_deepspeed_min_repro.py'
​
# nn_torch = nn.Linear(5, 5)
# nn_ds = deepspeed.init_inference(
#     model=nn_torch,
#     mp_size=N_GPUS,
#     dtype=torch.float16,
#     replace_method=False,
#     replace_with_kernel_inject=True,
# )
# exit()
​
​
###################################################
# -- this doesnt work with either. Maybe because what the deepspeed launcher does is likely
#    not propagated on to the new processes?
@serve.deployment(
    name="DeepspeedNetMPNet",
    _autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 1,
    },
    ray_actor_options={"num_gpus": N_GPUS, "num_cpus": 1},
    version="0.0.1",
)
class Net:
    def __init__(self):
        self.nn_torch = nn.Linear(64, 64)
        self.nn_ds = deepspeed.init_inference(
            model=self.nn_torch,
            mp_size=N_GPUS,
            dtype=torch.float16,
            replace_method=False,
            replace_with_kernel_inject=True,
        )
​
    # @serve.batch(max_batch_size=4)
    async def __call__(self,
                       requests_batch: List[torch.Tensor],
                       ) -> List[Dict[str, Any]]:
        with torch.no_grad():
            return self.nn_ds(torch.stack(requests_batch))
​
​
serve.start()
Net.deploy()
​
# Serve will be shut down once the script exits, so keep it alive manually.
while True:
    time.sleep(5)
    print("Deployments:")
    for k, v in serve.list_deployments().items():
        print(f"{k} | {v}\n")

Use case

Multi-GPU model parallelization helps speed up large-neural-net APIs.

Examples:

  1. Single node, multiple GPUs for large neural nets
  2. Very large neural nets across multiple nodes with multiple GPUs each
  3. Speeding up inference for speed-critical API tasks for smaller/medium-scale NNs

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

EricSteinberger avatar Mar 03 '22 21:03 EricSteinberger

Hi @EricSteinberger, thanks for providing context for the issue. You're testing a workload that we think is critical but haven't invested much yet, therefore some edges could be rough. I haven't got my hands on DeepSpeed before but after checking into the launcher script I think you would probably have easier time to prototype directly with Ray Actor APIs to spawn and coordinate child processes via actors. Then for advanced communication pattern via NCCL you might find ray.collective helpful: https://docs.ray.io/en/latest/ray-more-libs/ray-collective.html, then move on to scale this distributed inference workload with ray serve.

We need to understand better about your workload first to be more effective -- i've sent you a message on Ray Slack.

jiaodong avatar Apr 25 '22 18:04 jiaodong

Hi! I replied on Slack with more info about our usecase. Thank you for looking into the issue!

EricSteinberger avatar Apr 26 '22 11:04 EricSteinberger

Hey @jiaodong,

I am also looking to use DeepSpeed + ray to inference using model parallelism.

Has the recommendation changed from above? I would very much appreciate if you could point me to the latest documentation for these concepts.

Thanks!

jamjambles avatar Apr 30 '24 08:04 jamjambles