DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[REQUEST] Model serving via deepspeed's inference module

Open callzhang opened this issue 3 years ago • 9 comments

Is your feature request related to a problem? Please describe. No

Describe the solution you'd like I am trying to run my model serving code in a model-parallel fashion. The tutorial shows how to run code on multi-GPU but the data is predefined, which cannot be used for serving. My original code is using fastapi to do the serving work. When using deepspeed --num_gpus n example.py the fastapi server will also be initiated n times, which cause port conflict.

Describe alternatives you've considered Do I have to first start the model in parallel using deepspeed in one script and then start another script for fastapi, and finally connect them somehow?

Additional context None.

callzhang avatar Oct 31 '21 08:10 callzhang

Hi @callzhang

Could you please provide a test script so that I can try on my end and can understand the problem better? Thanks, Reza

RezaYazdaniAminabadi avatar Nov 02 '21 16:11 RezaYazdaniAminabadi

Here is the minimum code I tried:

from fastapi import FastAPI, Request, Response, Query
from transformers import pipeline
import deepspeed, torch, os, uvicorn

app = FastAPI()

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='gpt2', device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.float,
                                           replace_method='auto')


@app.get("/gen")
def generate(text):
    return generator(text, max_length=100)

if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(f'initiating server on rank: {local_rank}')
    uvicorn.run(
        "min_example_deepspeed_mp:app", 
        host="0.0.0.0", port=8500, 
        log_level="info", 
        workers=1
    )

Then I ran deepspeed --num_gpus 2 min_example_deepspeed_mp.py and I got the following error:

[2021-11-03 01:33:39,359] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-11-03 01:33:39,373] [INFO] [runner.py:360:main] cmd = /home/stardust/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 nlp/sentence_generation/min_example_deepspeed_mp.py
[2021-11-03 01:33:39,993] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2021-11-03 01:33:39,994] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=2, node_rank=0
[2021-11-03 01:33:39,994] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2021-11-03 01:33:39,994] [INFO] [launch.py:102:main] dist_world_size=2
[2021-11-03 01:33:39,994] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1
initiating server on rank: 1
initiating server on rank: 0
initiating server on rank: 0
Traceback (most recent call last):
  File "nlp/sentence_generation/min_example_deepspeed_mp.py", line 17, in <module>
    uvicorn.run(
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/main.py", line 447, in run
    server.run()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 68, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/home/stardust/anaconda3/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 76, in serve
    config.load()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/config.py", line 448, in load
    self.loaded_app = import_from_string(self.app)
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/importer.py", line 21, in import_from_string
    module = importlib.import_module(module_str)
  File "/home/stardust/anaconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/stardust/algorithms-playground/nlp/sentence_generation/min_example_deepspeed_mp.py", line 17, in <module>
    uvicorn.run(
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/main.py", line 447, in run
    server.run()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 68, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/home/stardust/anaconda3/lib/python3.8/asyncio/runners.py", line 33, in run
    raise RuntimeError(
RuntimeError: asyncio.run() cannot be called from a running event loop
initiating server on rank: 1
Traceback (most recent call last):
  File "nlp/sentence_generation/min_example_deepspeed_mp.py", line 17, in <module>
    uvicorn.run(
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/main.py", line 447, in run
    server.run()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 68, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/home/stardust/anaconda3/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 76, in serve
    config.load()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/config.py", line 448, in load
    self.loaded_app = import_from_string(self.app)
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/importer.py", line 21, in import_from_string
    module = importlib.import_module(module_str)
  File "/home/stardust/anaconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/stardust/algorithms-playground/nlp/sentence_generation/min_example_deepspeed_mp.py", line 17, in <module>
    uvicorn.run(
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/main.py", line 447, in run
    server.run()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 68, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/home/stardust/anaconda3/lib/python3.8/asyncio/runners.py", line 33, in run
    raise RuntimeError(
RuntimeError: asyncio.run() cannot be called from a running event loop
sys:1: RuntimeWarning: coroutine 'Server.serve' was never awaited
sys:1: RuntimeWarning: coroutine 'Server.serve' was never awaited
Killing subprocess 8312
Killing subprocess 8313
Traceback (most recent call last):
  File "/home/stardust/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/stardust/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/stardust/anaconda3/bin/python', '-u', 'nlp/sentence_generation/min_example_deepspeed_mp.py', '--local_rank=1']' returned non-zero exit status 1.

callzhang avatar Nov 02 '21 17:11 callzhang

Thanks I will try this on my end

RezaYazdaniAminabadi avatar Nov 11 '21 07:11 RezaYazdaniAminabadi

Any update on this? Is there another recommended way to do this - for instance, if we wanted to run with uvicorn and thus couldn't use the deepspeed launcher?

david-rx avatar Jan 05 '22 02:01 david-rx

Can you resolved it by gunicorn?

ForgetThatNight avatar Jul 11 '22 06:07 ForgetThatNight

any updates on this issue @RezaYazdaniAminabadi

gd1m3y avatar Dec 29 '22 08:12 gd1m3y

Any update on this?

rahulvramesh avatar Mar 17 '23 17:03 rahulvramesh

Any update or workaround?

disbullief avatar Apr 13 '23 14:04 disbullief

Face same problem in my project.

hanrui1sensetime avatar Apr 21 '23 12:04 hanrui1sensetime

code

from transformers import pipeline
import transformers
import deepspeed
import torch
import os
from transformers.models.t5.modeling_t5 import T5Block
import sys
import torch.distributed as dist


local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '2'))

pipe = pipeline("text2text-generation", model="t5-v1_1-small", device=local_rank)

pipe.model = deepspeed.init_inference(
    pipe.model,
    mp_size=world_size,
    dtype=torch.float
)

pipe.device = torch.device(f'cuda:{local_rank}')


if not dist.is_initialized() or dist.get_rank() == 0:
    from flask import Flask
    app = Flask(__name__)

    @app.route("/")
    def hello_world():
        object_list = ["Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"]
        dist.broadcast_object_list(object_list, src=0)
        output = pipe(object_list)
        print(output)
        return "<p>Hello, World!</p>%s"%output
    app.run()

else:
    while True:
        object_list = [None]
        dist.broadcast_object_list(object_list, src=0)
        output = pipe(object_list)

output

root@000d2e0398b9:/workspace# curl http://127.0.0.1:5000
<p>Hello, World!</p>[{'generated_text': 'd review: this is the best cast iron skillet. Great review! Great review! Great'}]

I used 'broadcast_object_list' method to solve this problem.

CNTRYROA avatar Jun 01 '23 11:06 CNTRYROA