pytorch-lightning
pytorch-lightning copied to clipboard
DDP training timeout
Bug description
I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
The configs might relate to the training:
trainer:
devices: 8 # number of GPUs or CPUs
num_nodes: 1
accelerator: gpu #gpu or cpu
precision: 16 #16 or 32
logger: False # logger is provided by NeMo exp_manager
enable_checkpointing: False # checkpointing is done by NeMo exp_manager
replace_sampler_ddp: False # use NeMo Megatron samplers
max_epochs: null # # use max_steps instead with NeMo Megatron model
log_every_n_steps: 10 # number of interations between logging
val_check_interval: 15e4
limit_val_batches: 50 # number of batches in validation step, use fraction for fraction of data, 0 to disable
limit_test_batches: 500 # number of batches in test step, use fraction for fraction of data, 0 to disable
accumulate_grad_batches: 1
gradient_clip_val: 1.0
benchmark: False
max_steps: 500000
### Error messages and logs
Epoch 0: 6%|██ | 32040/500150 [6:28:43<94:39:17, 1.37it/s, loss=2.6, v_num=95nc, reduced_train_loss=2.590, global_step=3.2e+4, consumed_samples=2.56e+7][E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624886 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out. [E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. [E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out. [E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800741 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800733 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800769 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800847 milliseconds before timing out.
### Environment
a03-zpeng@m3dgx01:~$ pip list Package Version Location
absl-py 1.4.0 accessible-pygments 0.0.4 aiohttp 3.9.0 aiosignal 1.3.1 alabaster 0.7.13 aniso8601 9.0.1 annotated-types 0.6.0 antlr4-python3-runtime 4.9.3 apex 0.1 appdirs 1.4.4 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.2.1 astunparse 1.6.3 async-timeout 4.0.2 attrdict 2.0.1 attrs 23.1.0 audioread 3.0.0 awscli 1.29.67 Babel 2.12.1 backcall 0.2.0 beautifulsoup4 4.12.2 bionemo 0.2.0.dev0 /workspace/bionemo biopandas 0.4.1 biopython 1.79 black 23.1.0 bleach 6.0.0 blinker 1.6.2 blis 0.7.9 boto3 1.28.10 botocore 1.31.67 braceexpand 0.1.7 Brotli 1.1.0 cachetools 5.3.1 catalogue 2.0.8 cdifflib 1.2.6 certifi 2023.7.22 cffi 1.15.1 cfgv 3.4.0 charset-normalizer 3.1.0 click 8.1.7 cloudpickle 2.2.1 cmake 3.24.1.1 colorama 0.4.4 coloredlogs 15.0.1 comm 0.1.3 commonmark 0.9.1 confection 0.0.4 contourpy 1.0.7 coverage 7.4.0 crc32c 2.3.post0 cubinlinker 0.3.0+2.g87b01ae cuda-python 12.1.0rc5+1.g38940ef cudf 23.4.0 cugraph 23.4.0 cugraph-dgl 23.4.0 cugraph-service-client 23.4.0 cugraph-service-server 23.4.0 cuml 23.4.0 cupy-cuda12x 12.0.0b3 cycler 0.11.0 cymem 2.0.7 Cython 0.29.35 dacite 1.8.1 dask 2023.3.2 dask-cuda 23.4.0 dask-cudf 23.4.0 debugpy 1.6.7 decorator 5.1.1 defusedxml 0.7.1 dgl 1.1.3 dgllife 0.2.8 diffdock 0.0.5 dill 0.3.7 Distance 0.1.3 distlib 0.3.8 distributed 2023.3.2.1 DLLogger 1.0.0 docker-pycreds 0.4.0 docopt 0.6.2 docutils 0.16 e3nn 0.5.1 editdistance 0.6.2 einops 0.6.1 exceptiongroup 1.1.1 execnet 1.9.0 executing 1.2.0 expecttest 0.1.3 fair-esm 2.0.0 faiss-cpu 1.7.4 fastjsonschema 2.17.1 fastrlock 0.8.1 fasttext 0.9.2 filelock 3.12.2 fire 0.5.0 flash-attn 1.0.7 Flask 2.2.5 Flask-RESTful 0.3.10 flatbuffers 23.5.26 fonttools 4.47.2 frozenlist 1.3.3 fsspec 2023.5.0 ftfy 6.1.1 future 0.18.3 g2p-en 2.1.0 gast 0.4.0 gdown 4.7.1 gevent 23.9.1 geventhttpclient 2.0.2 gitdb 4.0.10 GitPython 3.1.41 google-auth 2.20.0 google-auth-oauthlib 0.4.6 graphsurgeon 0.4.6 graphviz 0.20.1 greenlet 3.0.3 grpcio 1.56.0 h5py 3.9.0 huggingface-hub 0.20.2 humanfriendly 10.0 hydra-core 1.2.0 hyperopt 0.2.7 hypothesis 5.35.1 identify 2.5.33 idna 3.4 ijson 3.2.3 imagesize 1.4.1 importlib-metadata 6.6.0 inflect 7.0.0 iniconfig 2.0.0 intel-openmp 2021.4.0 ipadic 1.0.0 ipdb 0.13.11 ipykernel 6.23.3 ipython 8.14.0 ipython-genutils 0.2.0 ipywidgets 8.0.7 isort 5.12.0 itsdangerous 2.1.2 jedi 0.18.2 jieba 0.42.1 Jinja2 3.1.2 jiwer 2.5.2 jmespath 1.0.1 joblib 1.2.0 json5 0.9.14 jsonlines 4.0.0 jsonschema 4.17.3 jupyter_client 8.3.0 jupyter_core 5.3.1 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab-pygments 0.2.2 jupyterlab-server 1.2.0 jupyterlab-widgets 3.0.8 jupytext 1.14.6 k2 1.24.3.dev20230725+cuda12.1.torch2.1.0a0 kaldi-python-io 1.2.2 kaldiio 2.18.0 kiwisolver 1.4.4 kornia 0.6.12 langcodes 3.3.0 latexcodec 2.0.1 Levenshtein 0.21.1 librosa 0.9.2 lightning-utilities 0.9.0 llvmlite 0.39.1 locket 1.0.0 loguru 0.7.0 lxml 4.9.3 Markdown 3.4.3 markdown-it-py 2.2.0 markdown2 2.4.9 MarkupSafe 2.1.3 marshmallow 3.20.1 matplotlib 3.4.3 matplotlib-inline 0.1.6 mdit-py-plugins 0.4.0 mdurl 0.1.2 mecab-python3 1.0.5 megatron-core 0.2.0 mistune 3.0.1 mkl 2021.1.1 mkl-devel 2021.1.1 mkl-include 2021.1.1 mock 5.0.2 more-itertools 10.1.0 mpmath 0.19 msgpack 1.0.5 multidict 6.0.4 murmurhash 1.0.9 mypy-extensions 1.0.0 nbclient 0.8.0 nbconvert 7.6.0 nbformat 5.9.0 nemo-text-processing 0.1.8rc0 nemo-toolkit 1.20.0 nest-asyncio 1.5.6 networkx 2.6.3 ninja 1.11.1 nltk 3.8.1 nodeenv 1.8.0 notebook 6.4.10 numba 0.56.4+1.g5f1bc7084 numpy 1.22.2 nvidia-dali-cuda120 1.26.0 nvidia-pyindex 1.0.9 nvidia-pytriton 0.4.0 nvtx 0.2.5 oauthlib 3.2.2 omegaconf 2.2.3 onnx 1.14.1 onnx-graphsurgeon 0.3.27 onnxruntime-gpu 1.16.3 onnxscript 0.1.0.dev20240113 OpenCC 1.1.6 opencv 4.6.0 opt-einsum 3.3.0 opt-einsum-fx 0.1.4 packaging 23.1 pandas 1.5.2 pandocfilters 1.5.0 pangu 4.0.6.1 parameterized 0.9.0 parso 0.8.3 partd 1.4.0 pathspec 0.11.1 pathtools 0.1.2 pathy 0.10.2 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.0.1 pip 21.2.4 pipdeptree 2.13.0 plac 1.3.5 platformdirs 4.1.0 pluggy 1.2.0 ply 3.11 polars 0.16.7 polygraphy 0.47.1 pooch 1.7.0 portalocker 2.7.0 POT 0.7.0 pre-commit 3.4.0 preshed 3.0.8 prettytable 3.8.0 progress 1.6 prometheus-client 0.17.0 prompt-toolkit 3.0.38 protobuf 3.20.3 psutil 5.9.4 ptxcompiler 0.8.1+1.gbe9fca5 ptyprocess 0.7.0 pure-eval 0.2.2 py 1.11.0 py-cpuinfo 9.0.0 py4j 0.10.9.7 pyannote.core 5.0.0 pyannote.database 5.0.1 pyannote.metrics 3.2.1 pyarrow 14.0.1 pyasn1 0.5.0 pyasn1-modules 0.3.0 pybind11 2.10.4 pybtex 0.24.0 pybtex-docutils 1.0.2 pycocotools 2.0+nv0.7.3 pycparser 2.21 pydantic 2.5.3 pydantic_core 2.14.6 pydata-sphinx-theme 0.13.1 pydub 0.25.1 pyfaidx 0.7.2 pyfastx 1.1.0 Pygments 2.15.1 pylibcugraph 23.4.0 pylibcugraphops 23.4.0 pylibraft 23.4.0 Pympler 1.0.1 pynini 2.1.5 pynvml 11.4.1 pyparsing 3.0.9 pypinyin 0.49.0 pypinyin-dict 0.6.0 pyrsistent 0.19.3 PySocks 1.7.1 pytest 7.4.0 pytest-cov 4.1.0 pytest-dependency 0.5.1 pytest-forked 1.6.0 pytest-rerunfailures 11.1.2 pytest-runner 6.0.0 pytest-shard 0.1.2 pytest-timeout 2.2.0 pytest-xdist 3.3.1 python-dateutil 2.8.2 python-hostlist 1.23.0 python-rapidjson 1.14 python-slugify 8.0.1 pytorch-lightning 1.9.4 pytorch-quantization 2.1.2 pytz 2023.3 PyYAML 6.0 pyzmq 23.2.1 raft-dask 23.4.0 rapidfuzz 2.13.7 rdkit 2023.9.1 rdkit-pypi 2022.9.5 regex 2023.6.3 requests 2.31.0 requests-mock 1.11.0 requests-oauthlib 1.3.1 resampy 0.4.2 rich 12.6.0 rmm 23.4.0 rouge-score 0.1.2 rsa 4.7.2 ruamel.yaml 0.17.32 ruamel.yaml.clib 0.2.7 ruff 0.0.292 s3transfer 0.7.0 sacrebleu 2.3.1 sacremoses 0.0.53 safetensors 0.3.1 scikit-learn 1.2.0 scipy 1.10.1 seaborn 0.12.2 Send2Trash 1.8.2 sentence-transformers 2.2.2 sentencepiece 0.1.99 sentry-sdk 1.28.1 setproctitle 1.3.2 setuptools 65.5.1 sh 1.14.3 shellingham 1.5.0.post1 six 1.16.0 smart-open 6.3.0 smmap 5.0.0 snowballstemmer 2.2.0 sortedcontainers 2.4.0 soundfile 0.12.1 soupsieve 2.4.1 sox 1.4.1 spacy 3.5.3 spacy-legacy 3.0.12 spacy-loggers 1.0.4 Sphinx 5.3.0 sphinx-book-theme 1.0.0 sphinx-copybutton 0.5.2 sphinx-glpi-theme 0.3 sphinxcontrib-applehelp 1.0.4 sphinxcontrib-bibtex 2.5.0 sphinxcontrib-devhelp 1.0.2 sphinxcontrib-htmlhelp 2.0.1 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.3 sphinxcontrib-serializinghtml 1.1.5 sphinxext-opengraph 0.8.2 spyrmsd 0.5.2 srsly 2.4.6 stack-data 0.6.2 sympy 1.12 tabulate 0.9.0 tbb 2021.9.0 tblib 1.7.0 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorrt 8.6.1 termcolor 2.3.0 terminado 0.17.1 testbook 0.4.2 text-unidecode 1.3 textdistance 4.5.0 texterrors 0.4.4 tfrecord 1.14.1 thinc 8.1.10 threadpoolctl 3.1.0 thriftpy2 0.4.16 tinycss2 1.2.1 tokenizers 0.15.0 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 2.1.0a0+4136153 torch-cluster 1.6.1 torch-geometric 2.3.0 torch-scatter 2.0.9 torch-sparse 0.6.17 torch-tensorrt 1.5.0.dev0 torchaudio 2.1.0 torchdata 0.7.0a0 torchmetrics 1.0.1 torchvision 0.16.0a0 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 transformer-engine 0.9.0 transformers 4.36.0 treelite 3.2.0 treelite-runtime 3.2.0 triton 2.0.0.dev20221202 triton-model-navigator 0.7.4 tritonclient 2.41.1 typed-ast 1.5.5 typer 0.7.0 types-dataclasses 0.6.6 typing_extensions 4.6.3 typing-inspect 0.6.0 ucx-py 0.31.0 uff 0.6.9 urllib3 1.26.16 virtualenv 20.25.0 wandb 0.15.6 wasabi 1.1.2 wcwidth 0.2.6 webdataset 0.2.33 webencodings 0.5.1 Werkzeug 2.3.6 wget 3.2 wheel 0.40.0 widgetsnbextension 4.0.8 wrapt 1.14.1 xdoctest 1.0.2 xgboost 1.7.5 yarl 1.9.2 youtokentome 1.0.6 zict 3.0.0 zipp 3.15.0 zope.event 5.0 zope.interface 6.1
### More info
I am using the nvidia BioNeMo framework.
@pengzhangzhi Can you describe the steps to reproduce this? There are several notebooks in the examples folder https://github.com/NVIDIA/BioNeMo/tree/main/examples/service/notebooks but I doubt you are running these. Where is the code and config that you are running?
Hi @awaelchli, the github repo does not have the whole training code, I got it from their docker containers. If you want to reproduce their code, here is the doc https://docs.nvidia.com/bionemo-framework/latest/quickstart-fw.html I am afraid it is too much work for you to reproduce it since you have to download and prepare all the data. My problem is kind of tricky that the nccl timeout happens during the training, sometimes earlier sometimes later, but it just out of nowhere. I would like to know how to track down the errors bc I have no clue given the error log. I have tried many solutions but no luck, such as:
- increase the timeout to a year
- set a bunch of nccl variables.
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=1
export NCCL_P2P_LEVEL=NVL
export NCCL_IB_GID_INDEX=3
If you insist on reproducing it, I am happy to help and give a detailed guide. For simplicity, it would be great to guide me how to debug this error. Thanks!!
Running into the same problem. I think it is hardware-independent. The code here uses the pytorch-lightning and NeMo frameworks. It happens after 8 hours of training.
Hey @pengzhangzhi Sorry, lots going on recently, trying to balance priorities. Sorry for missing or delayed replies.
I implemented a system check utility to help with such problems: Feel free to test it out if you have the time https://github.com/Lightning-AI/pytorch-lightning/pull/19609 The idea of this system check is that it is implemented in raw PyTorch, so if issues arise we would know whether it is in Lightning or not.
Thanks! Exactly what I need for debugging! Do u have a document for how to use this tool? The link in that pr isn't valid :( 📚 Documentation preview 📚: pytorch-lightning--19609.org.readthedocs.build/en/19609
Yes the docs will only generate once the PR is ready.
The easiest for you to try it right now is to just copy this file
https://github.com/Lightning-AI/pytorch-lightning/blob/feature/system-check/src/lightning/fabric/utilities/system_check.py
locally and just run it as python system_check.py. It doesn't require any dependencies other than torch and psutil.
Thanks!! Here is log... I can't make sense of it.. FYI, the code that I am having a problem with has been employed in two systems and both of them have the same timeout problem. Additionally I have another pytorch-lightning, torch DDP project, it works well in the current system without any timeout error.
Below is the output of `nvidia-smi`. It shows information about the GPUs that are installed on this machine, the driver version, and the maximum supported CUDA version it can run.
Thu Mar 14 17:08:07 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB Off | 00000000:07:00.0 Off | 0 |
| N/A 24C P0 59W / 400W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB Off | 00000000:0F:00.0 Off | 0 |
| N/A 22C P0 55W / 400W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB Off | 00000000:47:00.0 Off | 0 |
| N/A 22C P0 58W / 400W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB Off | 00000000:4E:00.0 Off | 0 |
| N/A 22C P0 58W / 400W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB Off | 00000000:87:00.0 Off | 0 |
| N/A 29C P0 59W / 400W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB Off | 00000000:90:00.0 Off | 0 |
| N/A 27C P0 59W / 400W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB Off | 00000000:B7:00.0 Off | 0 |
| N/A 51C P0 259W / 400W | 80363MiB / 81920MiB | 90% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB Off | 00000000:BD:00.0 Off | 0 |
| N/A 50C P0 213W / 400W | 76741MiB / 81920MiB | 71% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
The matrix below shows how the GPUs in this machine are connected. NVLink (NV) is the fastest connection, and is only available on high-end systems like V100, A100, etc.
[4mGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID[0m
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 16-31,144-159 1 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 16-31,144-159 1 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 N/A
NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS
NIC3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX SYS SYS SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X SYS SYS SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS
NIC7 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS
NIC8 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS X PXB SYS SYS
NIC9 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS PXB X SYS SYS
NIC10 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
NIC11 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
NCCL version 2.18.1+cuda12.1
Traceback (most recent call last):
File "/workspace/bionemo/debug.py", line 179, in <module>
main()
File "/workspace/bionemo/debug.py", line 48, in main
success = _check_cuda_distributed(timeout)
File "/workspace/bionemo/debug.py", line 84, in _check_cuda_distributed
success = context.join(timeout=5)
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 6 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/workspace/bionemo/debug.py", line 116, in _run_all_reduce_test
torch.distributed.barrier()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 145, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3553, in barrier
work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1164, internal error - please report this issue to the NCCL developers, NCCL version 2.18.1
ncclInternalError: Internal check failed.
Last error:
Socket recv failed while polling for opId=0x7f8be9d30b00
This output shows that distributed PyTorch won't work on your system. It can't synchronize at the barrier, which is a very basic requirement.
There should be a folder system_check, it might have additional logs from NCCL with warnings. In rare occasions, a driver update or downgrade can help, or reinstalling PyTorch in a fresh environment.
Thanks!!
Since the error is in process 6, I am showing the log of nccl-rank-6 below:
pbg-dgx-1:1243335:1243335 [6] NCCL INFO cudaDriverVersion 12020
pbg-dgx-1:1243335:1243335 [6] NCCL INFO Bootstrap : Using enp226s0:10.148.54.242<0>
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
pbg-dgx-1:1243335:1244249 [6] NCCL INFO P2P plugin IBext
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NET/IB : No device found.
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NET/IB : No device found.
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NET/Socket : Using [0]enp226s0:10.148.54.242<0> [1]vethe42dbfb:fe80::a0ec:26ff:fe84:f001%vethe42dbfb<0> [2]vethd8013ee:fe80::f8e4:5dff:febd:1409%vethd8013ee<0> [3]vethfa79788:fe80::6813:94ff:fe9a:17e4%vethfa79788<0>
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Using network Socket
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NVLS multicast support is not available on dev 6
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
pbg-dgx-1:1243335:1244249 [6] NCCL INFO P2P Chunksize set to 524288
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 04/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 05/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 06/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244302 [6] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory'
pbg-dgx-1:1243335:1244302 [6] include/alloc.h:185 NCCL WARN Failed to CUDA calloc 6291456 bytes
pbg-dgx-1:1243335:1244302 [6] NCCL INFO transport/p2p.cc:204 -> 1
pbg-dgx-1:1243335:1244302 [6] NCCL INFO transport/p2p.cc:584 -> 1
pbg-dgx-1:1243335:1244302 [6] NCCL INFO proxy.cc:1303 -> 1
pbg-dgx-1:1243335:1244302 [6] NCCL INFO proxy.cc:1377 -> 1
pbg-dgx-1:1243335:1244302 [6] proxy.cc:1518 NCCL WARN [Proxy Service 6] Failed to execute operation Setup from rank 6, retcode 1
pbg-dgx-1:1243335:1244249 [6] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer pbg-dgx-1.egr.duke.edu<50889>
pbg-dgx-1:1243335:1244249 [6] NCCL INFO misc/socket.cc:746 -> 6
pbg-dgx-1:1243335:1244249 [6] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f8be9d30b00
pbg-dgx-1:1243335:1244249 [6] NCCL INFO transport/p2p.cc:386 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO transport.cc:33 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO transport.cc:106 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO init.cc:1032 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO init.cc:1309 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO group.cc:64 -> 3 [Async thread]
pbg-dgx-1:1243335:1243335 [6] NCCL INFO group.cc:422 -> 3
pbg-dgx-1:1243335:1243335 [6] NCCL INFO group.cc:106 -> 3
pbg-dgx-1:1243335:1243335 [6] NCCL INFO comm 0x55a0a637e960 rank 6 nranks 8 cudaDev 6 busId b7000 - Abort COMPLETE
FYI, I am using a docker container. It can be reproduced by the following steps.
docker login nvcr.io
Username: $oauthtoken
Password NGc3bWIxM21mbTI0dTBraHE5N2U0NG1saWg6ZTY4MzlhZmUtYTJlZC00NDVmLThjYmEtNjA2ZTMzMzRkZTYy
Pull the Bionemo container:
docker pull nvcr.io/nvidia/clara/bionemo-framework:1.2
Run the container:
CONTAINER="nvcr.io/nvidia/clara/bionemo-framework:1.2"
DEST_PATH="."
CONTAINER_NAME=bionemo
docker run --name $CONTAINER_NAME -itd --rm $CONTAINER bash
To reproduce my error: You can copy this file feature/system-check/src/lightning/fabric/utilities/system_check.py to the container and run it.
I won't have the bandwidth to help much here. Maybe try disabling plugins: NCCL_NET_PLUGIN=none. And if you are running inside the docker, please make sure that it's picking the correct network interface. Run the system check outside the container in a clean environment to see if it's related to the container or not.
I think the problem I have on nccl-rank-6 is just OOM based on the log?
pbg-dgx-1:1243335:1244302 [6] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory'
if you ran my system check, that's not possible. It allocates very little memory on the GPU: https://github.com/Lightning-AI/pytorch-lightning/blob/297e9809d2da7ec2abb0f2e7c5e6c371ae0eaac8/src/lightning/fabric/utilities/system_check.py#L118
If what you show me there is the output of another program, then yes it looks like one rank runs out of memory. If one rank dies, the others will wait and hang forever.
Yeah. I think it is because some of the GPUs are already heavily utilized and triggers the OOM problem as shown in the log. I only ran your program in the container and the host, both of log showing OOM on two utilized GPUs.