DeepSpeed-MII icon indicating copy to clipboard operation
DeepSpeed-MII copied to clipboard

Multi node or remote machine inference doesn't work without "--force_multi" parameter

Open sarathkondeti opened this issue 2 years ago • 9 comments

I trying to figure out why my script doesn't work without "--force_multi" param in ds_launch_str https://github.com/microsoft/DeepSpeed-MII/blob/0fe4eb86b93e8210736f3e8c671bc886af64fd67/mii/server.py#L116

Expected parallelism: replicas 4 and TP 1 Hardware: 2 machines with 2xTesla T4s each hostfile: localhost slots=2 strange slots=2

my script:

import mii
mii_configs = {
                "tensor_parallel": 1,
                "dtype": "fp16",
                "replace_with_kernel_inject" : False,
                "load_with_sys_mem": True,
                "replica_num" : 4,
                "hostfile": "/home/anurag_dutt/vicuna_parallelism/hostfile",
            }
deployment = "vicuna_deployment"
mii.deploy(task='text-generation',
           model="lmsys/vicuna-7b-v1.5",
           model_path="/home/anurag_dutt/.cache/huggingface/hub",
           deployment_name=deployment,
           deployment_type=mii.DeploymentType.LOCAL,
           mii_config=mii_configs,
        )

I've also tried running a simple deepspeed script (TP 1) on remote machine('strange') using the below hostfile and run command. Hostfile: strange slots=1

run command: (doesn't work) deepspeed --hostfile ~/vicuna_parallelism/hostfile tp.py

run command: (works) deepspeed --force_multi --hostfile ~/vicuna_parallelism/hostfile tp.py

sarathkondeti avatar Oct 02 '23 07:10 sarathkondeti

Hi @sarathkondeti thanks for reporting this bug. I will try to replicate this on a local system. Just to clarify, are you able to run the MII deployment if you add --force_multi to the ds_launch_str?

mrwyattii avatar Oct 05 '23 21:10 mrwyattii

Yea, I rebuilt mii after adding that parameter, it works for me now.

sarathkondeti avatar Oct 05 '23 23:10 sarathkondeti

Thank you. I will work on a fix for this and it will be in the next MII release.

mrwyattii avatar Oct 09 '23 21:10 mrwyattii

+1 Same bug .. set up 12 v100 nodes

AskMrYogi avatar Oct 09 '23 21:10 AskMrYogi

@sarathkondeti and @AskMrYogi after taking a closer look at the DeepSpeed launcher code: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/runner.py#L489

I think the reason you must add --force_multi is because your hostfile is not being parsed correctly. Can you share what your hostfile looks like (or if you are not providing one to MII)? Thank you!

mrwyattii avatar Oct 10 '23 16:10 mrwyattii

hostfile looks like this

hostname1 slots=2 hostname2 slots=2

AskMrYogi avatar Oct 11 '23 16:10 AskMrYogi

Ok that looks correct, so there must be a bug somewhere. I will try to replicate on my side and find a fix. Thank you!

mrwyattii avatar Oct 11 '23 17:10 mrwyattii

Maybe because of this:

 #worker_str = f"-H {hostfile} "
  worker_str = ""

in server.py 175-176

Anditty avatar Nov 07 '23 02:11 Anditty

Maybe because of this:

 #worker_str = f"-H {hostfile} "
  worker_str = ""

in server.py 175-176

Use worker_str = f"-H {hostfile} instead of worker_str = ""

Anditty avatar Nov 07 '23 02:11 Anditty