DeepSpeed [BUG] Inference failed serveral times

Describe the bug When I run my inference code with deepspeed.init_inference(). It only works a few times with num_gpus=2 (num_gpus>2 always failed, num_gpus=2 sometimes failed). Following this link https://www.deepspeed.ai/tutorials/inference-tutorial/

To Reproduce Steps to reproduce the behavior: 0. Installation

conda create -n py38 -y python=3.8
conda activate py38 
conda install -c "nvidia/label/cuda-11.7.0" cuda-toolkit   # default installed nvcc in my machine is 10.1
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
pip install ninja 

git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_TRANSFORMER_INFERENCE=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log

My inference script

# run.py 

import deepspeed
import torch 
import os 
from transformers import AutoModel

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))

model = AutoModel.from_pretrained("vinai/phobert-base").eval()
model.to(local_rank)

# Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(model,
                                 mp_size=world_size,
                                 checkpoint=None,
                                 dtype=torch.float)

model = ds_engine.module

input_ids = torch.LongTensor([[0, 1, 2, 3]]).to(local_rank)

with torch.no_grad():
    output = model(input_ids=input_ids)

print(output.last_hidden_state)

How to run the script

deepspeed --num_gpus=2 run.py

Expected behavior Out when running successfully

(py38) root@quangthd-7c6fb44f48-qfdg7:/home/workspace/train_exp# deepspeed --num_gpus=2 run_1.py
[2023-04-11 02:37:20,770] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-11 02:37:20,864] [INFO] [runner.py:527:main] cmd = /root/miniconda3/envs/py38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_1.py
[2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NCCL_VERSION=2.13.4-1
[2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-04-11 02:37:23,949] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-04-11 02:37:23,949] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-04-11 02:37:23,949] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-04-11 02:37:23,949] [INFO] [launch.py:151:main] dist_world_size=2
[2023-04-11 02:37:23,949] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-04-11 02:37:35,509] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master
[2023-04-11 02:37:35,515] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-04-11 02:37:35,516] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-04-11 02:37:35,520] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-04-11 02:37:35,728] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master
[2023-04-11 02:37:35,730] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-04-11 02:37:35,731] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
AutoTP:  [(<class 'transformers.models.roberta.modeling_roberta.RobertaLayer'>, ['output.dense'])]
AutoTP:  [(<class 'transformers.models.roberta.modeling_roberta.RobertaLayer'>, ['output.dense'])]
tensor([[[-0.1506,  0.1654, -0.2740,  ..., -0.0372,  0.0528, -0.4088],
         [-0.0466,  0.0702,  0.1521,  ..., -0.3913,  0.2049, -0.2464],
         [-0.2765, -0.1203, -0.7273,  ..., -0.2264, -0.0797, -0.4175],
         [-0.2209, -0.2705, -0.7297,  ..., -0.3494,  0.2433, -0.7732]]],
       device='cuda:1')
tensor([[[-0.1506,  0.1654, -0.2740,  ..., -0.0372,  0.0528, -0.4088],
         [-0.0466,  0.0702,  0.1521,  ..., -0.3913,  0.2049, -0.2464],
         [-0.2765, -0.1203, -0.7273,  ..., -0.2264, -0.0797, -0.4175],
         [-0.2209, -0.2705, -0.7297,  ..., -0.3494,  0.2433, -0.7732]]],
       device='cuda:0')
[2023-04-11 02:37:39,011] [INFO] [launch.py:329:main] Process 12518 exits successfully.
[2023-04-11 02:37:39,012] [INFO] [launch.py:329:main] Process 12517 exits successfully.

ds_report output To create bug, you can run the script several times or increase the --num_gpus Here is the log

(py38) root@quangthd-7c6fb44f48-qfdg7:/home/workspace/train_exp# deepspeed --num_gpus=3 run_1.py
[2023-04-11 02:38:38,952] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-11 02:38:39,050] [INFO] [runner.py:527:main] cmd = /root/miniconda3/envs/py38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_1.py
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NCCL_VERSION=2.13.4-1
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-04-11 02:38:41,916] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2023-04-11 02:38:41,916] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=3, node_rank=0
[2023-04-11 02:38:41,916] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2023-04-11 02:38:41,916] [INFO] [launch.py:151:main] dist_world_size=3
[2023-04-11 02:38:41,916] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
[2023-04-11 02:39:07,402] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13678
[2023-04-11 02:39:07,603] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master
[2023-04-11 02:39:07,605] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-04-11 02:39:07,606] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-04-11 02:39:07,622] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13679
[2023-04-11 02:39:07,840] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13680
[2023-04-11 02:39:07,840] [ERROR] [launch.py:303:sigkill_handler] ['/root/miniconda3/envs/py38/bin/python', '-u', 'run_1.py', '--local_rank=2'] exits with return code = -9

ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/root/miniconda3/envs/py38/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.3+4d27225f, 4d27225f, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

System info (please complete the following information):

OS: Ubuntu 18.04.6 LTS
GPU: 1 machine 12 GPUs A16
Deepspeed version: 0.8.3
Transformers version: 4.27.4
Python version: 3.8.16

Apr 11 '23 02:04 Quang-elec44

+1 to be in loop. I suspect if it occurred because some deepspeed process kept running as you made multiple runs and then went OOM.

Apr 11 '23 14:04 satpalsr

@satpalsr Not really! My first attempt with --num_gpus >3 failed without any previous run

Apr 12 '23 05:04 Quang-elec44

I see that in your code the output is printed twice, once per GPU. Why is that? How to run the inference only once per example?

Also, is deepspeed inference supposed to copy over the model to all devices? On the tutorial page I see the following:

It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory.

How is it helping with this?

May 03 '23 20:05 udhavsethi

@udhavsethi According to the tutorial page, at this part, you can get the result from rank 0. About model parallelism, in my experience, it didn't work as I expected. It did split the model among GPUs, however, the total memory was higher than when I used only one GPU. Besides, the model was not equally split, the rank 0 gpu consumed more memory than others.

May 04 '23 01:05 Quang-elec44

I see that in your code the output is printed twice, once per GPU. Why is that? How to run the inference only once per example?

Also, is deepspeed inference supposed to copy over the model to all devices? On the tutorial page I see the following:

It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory.

How is it helping with this?

Hi, I am getting two outputs too instead of one, have you sorted out this issue?

Aug 15 '23 16:08 asifehmad

DeepSpeed DeepSpeed copied to clipboard

[BUG] Inference failed serveral times

DeepSpeed
DeepSpeed copied to clipboard