DeepSpeed
DeepSpeed copied to clipboard
Training stuck after loading the model?
Issue: Training doesn't begin after loading the model.
DS_REPORT
(base) ext_abdul.waheed@p4-r69-a:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
(base) ext_abdul.waheed@p4-r69-a:~$ which nvcc
/usr/local/cuda/bin/nvcc
(base) ext_abdul.waheed@p4-r69-a:~$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio isHere are the last few lines from the logs:
``` already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/nfs/users/ext_abdul.waheed/miniconda3/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/nfs/users/ext_abdul.waheed/miniconda3/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
(base) ext_abdul.waheed@p4-r69-a:~$
More Details:
HF Trainer,
DeepSpeed ZeRO Stage 2
Single node - 8xv100 - 32GB GPUs
CPU Per Task - 32
CPU Mem - 256GB
Here are the last few lines from the logs:
loading weights file pytorch_model.bin from cache at /nfs/users/ext_abdul.waheed/ASRs/experiments/cache/models/snapshots/bd0efe4d58db161e5ca3940e7c5940221e1b9646/pytorch_model.bin
All model checkpoint weights were used when initializing AutoModelForSpeechSeq2Seq.
loading model .......[changed]
Using cuda_amp half precision backend
wandb: Currently logged in as: macab. Use `wandb login --relogin` to force relogin
wandb: Appending key for api.wandb.ai to your netrc file: /nfs/users/ext_abdul.waheed/.netrc
wandb: Tracking run with wandb version 0.13.10
02/11/2023 03:22:43 - INFO - __main__ - *** Training ***
[2023-02-11 03:22:43,083] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-11 03:23:37,830] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combinationInstalled CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combinationInstalled CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.0 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Nothing happens after this. GPUs memory utilization remains same.
nvidia-smi
output
-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM3... On | 00000000:39:00.0 Off | 0 |
| N/A 36C P0 69W / 350W | 4308MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM3... On | 00000000:3B:00.0 Off | 0 |
| N/A 39C P0 68W / 350W | 4332MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 |
| N/A 30C P0 66W / 350W | 4332MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM3... On | 00000000:59:00.0 Off | 0 |
| N/A 38C P0 71W / 350W | 4332MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM3... On | 00000000:5C:00.0 Off | 0 |
| N/A 31C P0 69W / 350W | 4332MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM3... On | 00000000:5E:00.0 Off | 0 |
| N/A 39C P0 70W / 350W | 4332MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM3... On | 00000000:B7:00.0 Off | 0 |
| N/A 28C P0 66W / 350W | 4332MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM3... On | 00000000:B9:00.0 Off | 0 |
| N/A 28C P0 67W / 350W | 4212MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 19047 C ...eed/miniconda3/bin/python 4303MiB |
| 1 N/A N/A 19048 C ...eed/miniconda3/bin/python 4327MiB |
| 2 N/A N/A 19049 C ...eed/miniconda3/bin/python 4327MiB |
| 3 N/A N/A 19050 C ...eed/miniconda3/bin/python 4327MiB |
| 4 N/A N/A 19051 C ...eed/miniconda3/bin/python 4327MiB |
| 5 N/A N/A 19052 C ...eed/miniconda3/bin/python 4327MiB |
| 6 N/A N/A 19053 C ...eed/miniconda3/bin/python 4327MiB |
| 7 N/A N/A 19054 C ...eed/miniconda3/bin/python 4207MiB
CC: @HeyangQin @tjruwase
@macabdul9 removing the cached pytorch extension works for me.
rm -rf /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117
the same issue occur in my program. And I also find the reason of stuck is the tensor can not be moved to GPU. the same error will happen when i use tensor.cuda(). I don't even know how to fix it.
@theblackcat102, @macabdul9 is this issue now resolved?
@newtonysls, it sounds like a different problem can you please open a new ticket?
Thanks
@theblackcat102, @macabdul9 is this issue now resolved?
@newtonysls, it sounds like a different problem can you please open a new ticket?
Thanks
problem solve. My issue cased by the wrong setting of bios in the GPU
@macabdul9 removing the cached pytorch extension works for me.
rm -rf /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117
But how can I do this inside a docker container ?
same question.
thx!!!
saved my day!
sam Q
hi, i met the same question. Removing .cache does not work. Have you solve the problem? @LinB203 @Yiqiu-Zhang