[BUG] DeepSpeed Inference example in the tutorial got killed for no reason.
Describe the bug I am running the tutorial GPT-neo inference example in https://www.deepspeed.ai/tutorials/inference-tutorial/. This is my inference code:
import os
import deepspeed
import torch
from transformers import pipeline
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-125M',
device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
replace_with_kernel_inject=True
)
string = generator("DeepSpeed is", min_length=50, num_return_sequences=1, max_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(string)
I got no error message and the process get killed for no reason:
Setting ds_accelerator to cuda (auto detect)
[2023-06-08 04:35:52,281] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-08 04:35:52,296] [INFO] [runner.py:555:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ds_infer.py
Setting ds_accelerator to cuda (auto detect)
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-06-08 04:35:53,337] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-06-08 04:35:53,337] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-06-08 04:35:53,337] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-06-08 04:35:53,337] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-06-08 04:35:53,337] [INFO] [launch.py:163:main] dist_world_size=2
[2023-06-08 04:35:53,337] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
[2023-06-08 04:35:57,463] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.3+4559aa9b, git-hash=4559aa9b, git-branch=HEAD
[2023-06-08 04:35:57,463] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-06-08 04:35:57,464] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-06-08 04:35:57,466] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-06-08 04:35:57,466] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-06-08 04:35:57,466] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
[2023-06-08 04:35:57,555] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.3+4559aa9b, git-hash=4559aa9b, git-branch=HEAD
[2023-06-08 04:35:57,555] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-06-08 04:35:57,555] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-06-08 04:35:57,558] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-06-08 04:35:57,558] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-06-08 04:35:57,569] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 768, 'intermediate_size': 3072, 'heads': 12, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
------------------------------------------------------
Free memory : 20.773438 (GigaBytes)
Total memory: 23.689514 (GigaBytes)
Requested memory: 0.087891 (GigaBytes)
Setting maximum total tokens (input + output) to 1024
WorkSpace: 0x7f2916000000
------------------------------------------------------
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[2023-06-08 04:35:59,354] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 2824
[2023-06-08 04:35:59,354] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 2825
[2023-06-08 04:35:59,355] [ERROR] [launch.py:320:sigkill_handler] ['/usr/bin/python', '-u', 'ds_infer.py', '--local_rank=1'] exits with return code = -7
I have double checked that the process is not been killed due to OOM.
To Reproduce
deepspeed --num_gpus 2 ds_infer.py
ds_report output
Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.3+4559aa9b, 4559aa9b, HEAD
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
System info (please complete the following information):
- Ubuntu 20.04 LTS (in Docker)
- 2 x RTX 3090 24G
- transformers version:
4.31.0.dev0 - Python version:
1.13.1 - Any other relevant info about your setup
Docker context
I am using the Dockerfile from deepspeed's repo https://github.com/microsoft/DeepSpeed/blob/master/docker/Dockerfile with the base image changed to nvidia/cuda:11.7.0-devel-ubuntu20.04.
Same bug
Same bug
Same bug
same bug
maybe try add shm_size.
refer to this https://stackoverflow.com/questions/30210362/how-to-increase-the-size-of-the-dev-shm-in-docker-container or modify shm_size with a exist container.
- stop docker sudo systemctl stop docker
- cd to container_path and vim hostconfig
sudo -s
cd /var/lib/docker/container/YOUR_CONTAINER_ID
vim hostconfig.json
modify this ->> "ShmSize": 67108864
to ->> "ShmSize": 4294967296 (Choose a larger number that you like, I choose 4G) - restart docker service sudo systemctl restart docker
ref https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/2
maybe try add shm_size.
refer to this https://stackoverflow.com/questions/30210362/how-to-increase-the-size-of-the-dev-shm-in-docker-container or modify shm_size with a exist container.
- stop docker sudo systemctl stop docker
- cd to container_path and vim hostconfig sudo -s cd /var/lib/docker/container/YOUR_CONTAINER_ID vim hostconfig.json modify this ->> "ShmSize": 67108864 to ->> "ShmSize": 4294967296 (Choose a larger number that you like, I choose 4G)
- restart docker service sudo systemctl restart docker
ref https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/2
It seems not the reason of shm, my shm is about 100g, but still error when training.
Interesting! Having the same bug for 2 GPU settings, 1 GPU one is working fine for me.
Setting the shm_size to 128GB works for me.
Thanks!