DeepSpeed [BUG] DeepSpeed Inference example in the tutorial got killed for no reason.

Describe the bug I am running the tutorial GPT-neo inference example in https://www.deepspeed.ai/tutorials/inference-tutorial/. This is my inference code:

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-125M',
                     device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           replace_with_kernel_inject=True
                                            )

string = generator("DeepSpeed is", min_length=50, num_return_sequences=1, max_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

I got no error message and the process get killed for no reason:

Setting ds_accelerator to cuda (auto detect)
[2023-06-08 04:35:52,281] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-08 04:35:52,296] [INFO] [runner.py:555:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ds_infer.py
Setting ds_accelerator to cuda (auto detect)
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-06-08 04:35:53,337] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-06-08 04:35:53,337] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-06-08 04:35:53,337] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-06-08 04:35:53,337] [INFO] [launch.py:163:main] dist_world_size=2
[2023-06-08 04:35:53,337] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
[2023-06-08 04:35:57,463] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.3+4559aa9b, git-hash=4559aa9b, git-branch=HEAD
[2023-06-08 04:35:57,463] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-06-08 04:35:57,464] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-06-08 04:35:57,466] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-06-08 04:35:57,466] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-06-08 04:35:57,466] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
[2023-06-08 04:35:57,555] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.3+4559aa9b, git-hash=4559aa9b, git-branch=HEAD
[2023-06-08 04:35:57,555] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-06-08 04:35:57,555] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-06-08 04:35:57,558] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-06-08 04:35:57,558] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-06-08 04:35:57,569] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 768, 'intermediate_size': 3072, 'heads': 12, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
------------------------------------------------------
Free memory : 20.773438 (GigaBytes)  
Total memory: 23.689514 (GigaBytes)  
Requested memory: 0.087891 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7f2916000000 
------------------------------------------------------
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[2023-06-08 04:35:59,354] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 2824
[2023-06-08 04:35:59,354] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 2825
[2023-06-08 04:35:59,355] [ERROR] [launch.py:320:sigkill_handler] ['/usr/bin/python', '-u', 'ds_infer.py', '--local_rank=1'] exits with return code = -7

I have double checked that the process is not been killed due to OOM.

To Reproduce

deepspeed --num_gpus 2 ds_infer.py

ds_report output

Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.3+4559aa9b, 4559aa9b, HEAD
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

System info (please complete the following information):

Ubuntu 20.04 LTS (in Docker)
2 x RTX 3090 24G
transformers version: 4.31.0.dev0
Python version: 1.13.1
Any other relevant info about your setup

Docker context I am using the Dockerfile from deepspeed's repo https://github.com/microsoft/DeepSpeed/blob/master/docker/Dockerfile with the base image changed to nvidia/cuda:11.7.0-devel-ubuntu20.04.

Jun 08 '23 04:06 Entropy-xcy

Same bug

Jun 14 '23 09:06 ZZWHU

Same bug

Jun 18 '23 00:06 yingying123321

Same bug

Jun 19 '23 19:06 nopperl

same bug

Jun 25 '23 06:06 Zhanghahah

maybe try add shm_size.

refer to this https://stackoverflow.com/questions/30210362/how-to-increase-the-size-of-the-dev-shm-in-docker-container or modify shm_size with a exist container.

stop docker sudo systemctl stop docker
cd to container_path and vim hostconfig sudo -s cd /var/lib/docker/container/YOUR_CONTAINER_ID vim hostconfig.json modify this ->> "ShmSize": 67108864
to ->> "ShmSize": 4294967296 (Choose a larger number that you like, I choose 4G)
restart docker service sudo systemctl restart docker

ref https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/2

Jun 27 '23 02:06 treeaaa

maybe try add shm_size.

refer to this https://stackoverflow.com/questions/30210362/how-to-increase-the-size-of-the-dev-shm-in-docker-container or modify shm_size with a exist container.

stop docker sudo systemctl stop docker

cd to container_path and vim hostconfig sudo -s cd /var/lib/docker/container/YOUR_CONTAINER_ID vim hostconfig.json modify this ->> "ShmSize": 67108864 to ->> "ShmSize": 4294967296 (Choose a larger number that you like, I choose 4G)

restart docker service sudo systemctl restart docker

ref https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/2

It seems not the reason of shm, my shm is about 100g, but still error when training.

Jun 28 '23 07:06 realgump

Interesting! Having the same bug for 2 GPU settings, 1 GPU one is working fine for me.

Aug 14 '23 22:08 nadimintikrish

Setting the shm_size to 128GB works for me.

Thanks!

Apr 03 '24 18:04 Entropy-xcy