DeepSpeed
DeepSpeed copied to clipboard
[BUG] Inference failed serveral times
Describe the bug
When I run my inference code with deepspeed.init_inference()
. It only works a few times with num_gpus=2 (num_gpus>2 always failed, num_gpus=2 sometimes failed). Following this link https://www.deepspeed.ai/tutorials/inference-tutorial/
To Reproduce Steps to reproduce the behavior: 0. Installation
conda create -n py38 -y python=3.8
conda activate py38
conda install -c "nvidia/label/cuda-11.7.0" cuda-toolkit # default installed nvcc in my machine is 10.1
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
pip install ninja
git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_TRANSFORMER_INFERENCE=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log
- My inference script
# run.py
import deepspeed
import torch
import os
from transformers import AutoModel
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
model = AutoModel.from_pretrained("vinai/phobert-base").eval()
model.to(local_rank)
# Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(model,
mp_size=world_size,
checkpoint=None,
dtype=torch.float)
model = ds_engine.module
input_ids = torch.LongTensor([[0, 1, 2, 3]]).to(local_rank)
with torch.no_grad():
output = model(input_ids=input_ids)
print(output.last_hidden_state)
- How to run the script
deepspeed --num_gpus=2 run.py
Expected behavior Out when running successfully
(py38) root@quangthd-7c6fb44f48-qfdg7:/home/workspace/train_exp# deepspeed --num_gpus=2 run_1.py
[2023-04-11 02:37:20,770] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-11 02:37:20,864] [INFO] [runner.py:527:main] cmd = /root/miniconda3/envs/py38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_1.py
[2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NCCL_VERSION=2.13.4-1
[2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-04-11 02:37:23,949] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-04-11 02:37:23,949] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-04-11 02:37:23,949] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-04-11 02:37:23,949] [INFO] [launch.py:151:main] dist_world_size=2
[2023-04-11 02:37:23,949] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-04-11 02:37:35,509] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master
[2023-04-11 02:37:35,515] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-04-11 02:37:35,516] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-04-11 02:37:35,520] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-04-11 02:37:35,728] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master
[2023-04-11 02:37:35,730] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-04-11 02:37:35,731] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
AutoTP: [(<class 'transformers.models.roberta.modeling_roberta.RobertaLayer'>, ['output.dense'])]
AutoTP: [(<class 'transformers.models.roberta.modeling_roberta.RobertaLayer'>, ['output.dense'])]
tensor([[[-0.1506, 0.1654, -0.2740, ..., -0.0372, 0.0528, -0.4088],
[-0.0466, 0.0702, 0.1521, ..., -0.3913, 0.2049, -0.2464],
[-0.2765, -0.1203, -0.7273, ..., -0.2264, -0.0797, -0.4175],
[-0.2209, -0.2705, -0.7297, ..., -0.3494, 0.2433, -0.7732]]],
device='cuda:1')
tensor([[[-0.1506, 0.1654, -0.2740, ..., -0.0372, 0.0528, -0.4088],
[-0.0466, 0.0702, 0.1521, ..., -0.3913, 0.2049, -0.2464],
[-0.2765, -0.1203, -0.7273, ..., -0.2264, -0.0797, -0.4175],
[-0.2209, -0.2705, -0.7297, ..., -0.3494, 0.2433, -0.7732]]],
device='cuda:0')
[2023-04-11 02:37:39,011] [INFO] [launch.py:329:main] Process 12518 exits successfully.
[2023-04-11 02:37:39,012] [INFO] [launch.py:329:main] Process 12517 exits successfully.
ds_report output
To create bug, you can run the script several times or increase the --num_gpus
Here is the log
(py38) root@quangthd-7c6fb44f48-qfdg7:/home/workspace/train_exp# deepspeed --num_gpus=3 run_1.py
[2023-04-11 02:38:38,952] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-11 02:38:39,050] [INFO] [runner.py:527:main] cmd = /root/miniconda3/envs/py38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_1.py
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NCCL_VERSION=2.13.4-1
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-04-11 02:38:41,916] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2023-04-11 02:38:41,916] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=3, node_rank=0
[2023-04-11 02:38:41,916] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2023-04-11 02:38:41,916] [INFO] [launch.py:151:main] dist_world_size=3
[2023-04-11 02:38:41,916] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
[2023-04-11 02:39:07,402] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13678
[2023-04-11 02:39:07,603] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master
[2023-04-11 02:39:07,605] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-04-11 02:39:07,606] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-04-11 02:39:07,622] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13679
[2023-04-11 02:39:07,840] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13680
[2023-04-11 02:39:07,840] [ERROR] [launch.py:303:sigkill_handler] ['/root/miniconda3/envs/py38/bin/python', '-u', 'run_1.py', '--local_rank=2'] exits with return code = -9
ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/root/miniconda3/envs/py38/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.3+4d27225f, 4d27225f, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
System info (please complete the following information):
- OS: Ubuntu 18.04.6 LTS
- GPU: 1 machine 12 GPUs A16
- Deepspeed version: 0.8.3
- Transformers version: 4.27.4
- Python version: 3.8.16
+1 to be in loop. I suspect if it occurred because some deepspeed process kept running as you made multiple runs and then went OOM.
@satpalsr Not really! My first attempt with --num_gpus >3
failed without any previous run
I see that in your code the output is printed twice, once per GPU. Why is that? How to run the inference only once per example?
Also, is deepspeed inference supposed to copy over the model to all devices? On the tutorial page I see the following:
It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory.
How is it helping with this?
@udhavsethi According to the tutorial page, at this part, you can get the result from rank 0. About model parallelism, in my experience, it didn't work as I expected. It did split the model among GPUs, however, the total memory was higher than when I used only one GPU. Besides, the model was not equally split, the rank 0 gpu consumed more memory than others.
I see that in your code the output is printed twice, once per GPU. Why is that? How to run the inference only once per example?
Also, is deepspeed inference supposed to copy over the model to all devices? On the tutorial page I see the following:
It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory.
How is it helping with this?
Hi, I am getting two outputs too instead of one, have you sorted out this issue?