DeepSpeed Training stuck after loading the model?

Issue: Training doesn't begin after loading the model.

DS_REPORT

(base) ext_abdul.waheed@p4-r69-a:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
(base) ext_abdul.waheed@p4-r69-a:~$ which nvcc
/usr/local/cuda/bin/nvcc
(base) ext_abdul.waheed@p4-r69-a:~$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio isHere are the last few lines from the logs:
``` already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/nfs/users/ext_abdul.waheed/miniconda3/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/nfs/users/ext_abdul.waheed/miniconda3/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
(base) ext_abdul.waheed@p4-r69-a:~$

More Details:

HF Trainer, 
DeepSpeed ZeRO Stage 2
Single node - 8xv100 - 32GB GPUs
CPU Per Task - 32 
CPU Mem - 256GB

Here are the last few lines from the logs:

loading weights file pytorch_model.bin from cache at /nfs/users/ext_abdul.waheed/ASRs/experiments/cache/models/snapshots/bd0efe4d58db161e5ca3940e7c5940221e1b9646/pytorch_model.bin
All model checkpoint weights were used when initializing AutoModelForSpeechSeq2Seq.

loading model .......[changed]
Using cuda_amp half precision backend
wandb: Currently logged in as: macab. Use `wandb login --relogin` to force relogin
wandb: Appending key for api.wandb.ai to your netrc file: /nfs/users/ext_abdul.waheed/.netrc
wandb: Tracking run with wandb version 0.13.10

02/11/2023 03:22:43 - INFO - __main__ -   *** Training ***
[2023-02-11 03:22:43,083] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-11 03:23:37,830] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combinationInstalled CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination

Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combinationInstalled CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination

Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...

Nothing happens after this. GPUs memory utilization remains same.

nvidia-smi output


-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:39:00.0 Off |                    0 |
| N/A   36C    P0    69W / 350W |   4308MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   39C    P0    68W / 350W |   4332MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM3...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   30C    P0    66W / 350W |   4332MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM3...  On   | 00000000:59:00.0 Off |                    0 |
| N/A   38C    P0    71W / 350W |   4332MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM3...  On   | 00000000:5C:00.0 Off |                    0 |
| N/A   31C    P0    69W / 350W |   4332MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM3...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   39C    P0    70W / 350W |   4332MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM3...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   28C    P0    66W / 350W |   4332MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM3...  On   | 00000000:B9:00.0 Off |                    0 |
| N/A   28C    P0    67W / 350W |   4212MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     19047      C   ...eed/miniconda3/bin/python     4303MiB |
|    1   N/A  N/A     19048      C   ...eed/miniconda3/bin/python     4327MiB |
|    2   N/A  N/A     19049      C   ...eed/miniconda3/bin/python     4327MiB |
|    3   N/A  N/A     19050      C   ...eed/miniconda3/bin/python     4327MiB |
|    4   N/A  N/A     19051      C   ...eed/miniconda3/bin/python     4327MiB |
|    5   N/A  N/A     19052      C   ...eed/miniconda3/bin/python     4327MiB |
|    6   N/A  N/A     19053      C   ...eed/miniconda3/bin/python     4327MiB |
|    7   N/A  N/A     19054      C   ...eed/miniconda3/bin/python     4207MiB

CC: @HeyangQin @tjruwase

Feb 11 '23 00:02 macabdul9

@macabdul9 removing the cached pytorch extension works for me.

rm -rf /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117

Mar 01 '23 12:03 theblackcat102

the same issue occur in my program. And I also find the reason of stuck is the tensor can not be moved to GPU. the same error will happen when i use tensor.cuda(). I don't even know how to fix it.