DeepSpeedExamples Predict latency is more with 4 GPUs than 1 GPU

Predict latency is more with 4 GPUs than 1 GPU

Open jagadeeshi2i opened this issue 2 years ago • 1 comments

I am trying deepspeed inference with gtpneo-1.3B model. I am using the example here for reference.

# Filename: example.py
import os
import deepspeed
import datetime
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation',
                     model='EleutherAI/gpt-neo-1.3B',
                     device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.float,
                                           replace_method='auto')

# from parallelformers import parallelize
# parallelize(generator.model, num_gpus=2, fp16=True, verbose='detail')

start = datetime.datetime.now()
string = generator("DeepSpeed is", do_sample=True, min_length=50)
end = datetime.datetime.now()
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)
    print("Time for dp inference", (end - start).total_seconds() * 1000)

deepspeed --num_gpus 4 example.py
Time for dp inference 1457.596

deepspeed --num_gpus 1 example.py
Time for dp inference 666.149

The latency for inference does not makes sense as i see increased latency while using 4 GPUs compared to 1 GPU.

From the docs i see that this model support multi GPU inference with inter GPU communication. https://www.deepspeed.ai/tutorials/inference-tutorial/#end-to-end-gpt-neo-27b-inference

Environment: AWS p3.8xlarge instance.
NVIDIA-SMI 450.142.00   
Driver Version: 450.142.00   
CUDA Version: 11.0
deepspeed                     0.5.3
mpi4py                        3.1.1
ninja                         1.10.2.1
transformers                  4.11.2

4_gpu.log 1_gpu.log

Oct 05 '21 13:10 jagadeeshi2i

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Predict latency is more with 4 GPUs than 1 GPU

DeepSpeedExamples
DeepSpeedExamples copied to clipboard