DeepSpeed-MII waiting for server to start...

Hello, I start deploy in one node with 4GPU, and set tensor_parallel 2. program is always wating for server to start

code is:

hostfile is: 127.0.0.1 slots=2

Aug 08 '23 03:08 yunll

@yunll are you able to see any GPU memory usage (via nvidia-smi)? I am wondering if there is a problem loading the model. Either way, I think we could improve the feedback to user to be more descriptive with what the server is doing in the background.

Also, could you try without the grpc server? set deployment_type=mii.DeploymentType.NON_PERSISTENT in your call to mii.deploy() and launch with deepspeed --num_gpus 2 your_script.py

Aug 16 '23 23:08 mrwyattii

I'm facing the same issue with both persistent and non-persistent deployments. It's not loading the model on the GPUs. I've tried deepspeed and zero2 and zero3.

model_id = "codellama/CodeLlama-7b-Instruct-hf"
model_path = ".cache/huggingface/hub/"
mii_configs = {"tensor_parallel": 5, 
                "dtype": "fp16", 
                "trust_remote_code": True}

mii.deploy(task="text-generation",
        model=model_id,
        deployment_name="mii",
        model_path=model_path + model_id,
        mii_config=mii_configs,
        enable_deepspeed=True,
        enable_zero=False,
        deployment_type=mii.constants.DeploymentType.NON_PERSISTENT
    )

Sep 04 '23 12:09 infosechoudini

[2023-09-04 12:23:19,159] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['.local/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
shared memory (/dev/shm) size .... 28.73 GB

Sep 04 '23 12:09 infosechoudini

I'm facing the same issue with both persistent and non-persistent deployments. It's not loading the model on the GPUs. I've tried deepspeed and zero2 and zero3.

@infosechoudini what behavior are you seeing when you load the model with non-persistent deployment type or just using DeepSpeed? Does a simple script like the following run for you?

import torch
import deepspeed
import os
from transformers import pipeline

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))

task_name = "text-generation"
model_name = "gpt2"
input_strs = ["DeepSpeed is", "Microsoft is"]

def run():
    pipe = pipeline(task_name, model_name, torch_dtype=torch.float16, device=local_rank)

    pipe.model = deepspeed.init_inference(
        pipe.model,
        replace_with_kernel_inject=True,
        mp_size=world_size,
        dtype=torch.float16,
    )

    output = pipe(input_strs)
    print(output)

if __name__ == "__main__":
    run()

Run with deepspeed script.py

Sep 11 '23 23:09 mrwyattii

Hey,

Deepspeed works fine. I just finished training a model with deepspeed yesterday. I was messing around with it but couldnt find a solution.

It just hangs on waiting for server to start then crashes after it times out.

Sep 15 '23 14:09 infosechoudini

@infosechoudini I want to determine if there is a bug in MII or a problem in your environment that is causing this hang. I see that you are setting "tensor_parallel": 5. I have seen issues in the past with model sharding when using an odd number of GPUs. Could you try running with 4 GPUs?

Sep 18 '23 18:09 mrwyattii

Hi @mrwyattii , may I ask how to keep the restful server alive ? Here is my script

import mii

mii_configs = {
    "tensor_parallel": 2, 
    "dtype": "fp16",
    "enable_restful_api": True, 
    "restful_api_port": 35215,
    "skip_model_check": True
}
mii.deploy(task="text-generation",
           model="/path/to/my/model",
           deployment_name=MY_DEPLOYMENT",
           mii_config=mii_configs,
           deployment_type=mii.DeploymentType.NON_PERSISTENT
           )

It seems that after I ran deepspeed --num_gpus 2 api.py, the process just exited. The model was loaded on GPUs, but the server did not stay alive. Can you help me out ?

Sep 22 '23 10:09 Quang-elec44

Hi @mrwyattii what could be the potential reasons for server.py to keep waiting for server to start? when I ran the server with the test.py you gave, it seemed to work, so I guess model is not the problem that makes this problem, but when I run server.py, it waits until the server is live. nvidia-smi shows 448MB is used for each GPU while I try to load the 7b model, so i guess the model is not properly loaded. but why is it different between persistent and non-persistent deployment?

Apr 26 '24 22:04 moonbucks

DeepSpeed-MII DeepSpeed-MII copied to clipboard

waiting for server to start...

DeepSpeed-MII
DeepSpeed-MII copied to clipboard