DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Predict latency is more with 4 GPUs than 1 GPU
I am trying deepspeed inference with gtpneo-1.3B model. I am using the example here for reference.
# Filename: example.py
import os
import deepspeed
import datetime
import torch
from transformers import pipeline
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation',
model='EleutherAI/gpt-neo-1.3B',
device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.float,
replace_method='auto')
# from parallelformers import parallelize
# parallelize(generator.model, num_gpus=2, fp16=True, verbose='detail')
start = datetime.datetime.now()
string = generator("DeepSpeed is", do_sample=True, min_length=50)
end = datetime.datetime.now()
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(string)
print("Time for dp inference", (end - start).total_seconds() * 1000)
deepspeed --num_gpus 4 example.py
Time for dp inference 1457.596
deepspeed --num_gpus 1 example.py
Time for dp inference 666.149
The latency for inference does not makes sense as i see increased latency while using 4 GPUs compared to 1 GPU.
From the docs i see that this model support multi GPU inference with inter GPU communication. https://www.deepspeed.ai/tutorials/inference-tutorial/#end-to-end-gpt-neo-27b-inference
Environment: AWS p3.8xlarge instance.
NVIDIA-SMI 450.142.00
Driver Version: 450.142.00
CUDA Version: 11.0
deepspeed 0.5.3
mpi4py 3.1.1
ninja 1.10.2.1
transformers 4.11.2