[BUG] Multi-gpu training is much lower than single gpu (due to additional processes?)

Open RitaQian-westlake opened this issue 1 year ago • 0 comments

Describe the bug I use DeepSpeed ZeRO Stage 2 Offload integrated in pytorch-lightning to train my model. When I use single gpu, the training time for one epoch is about 5h. However, When I use two ranks, the time extends to about 12.5h. Something strange is that when I use single gpu, there is only one process by checking nvidia-smi, while it becomes two processes on each rank for 2-gpu training. If I use large batch size, it will report Out of CUDA memory error and exit. For single-gpu, the training just terminated. However, for 2-gpu, only one process on each rank exits and the remaining one is still running, after that the training continues normally with much faster speed (2.5h/epoch), which seems to be the desired speed (half of what with single-gpu). It looks that the additional process limits the speed of multi-gpu training. Is its existence normal? If so, what's its function? Is there anyway to avoid it? (Obviously I don't want to exit it by making Out of CUDA memory error manually.)

ds_report output

[2024-04-22 20:46:26,737] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/qianrt/soft/anaconda/build/envs/ml/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1
deepspeed install path ........... ['/home/qianrt/soft/anaconda/build/envs/ml/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
shared memory (/dev/shm) size .... 125.69 GB

Screenshots 2-gpu: single-gpu:

System info (please complete the following information):

OS: CentOS Linux release 7.4.1708
GPU: ranks [0,1] in one A40
Interconnects (if applicable): "NODE" - connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
Python version: 3.10.12

Launcher context pytorch-lightning

Apr 22 '24 13:04 RitaQian-westlake