Error Encountered During Multi-Node Pretraining with Torchrun
🐛 Describe the bug
Description:
We are conducting pretraining using our own data with the following torchrun command:
torchrun --nnodes=$NODES --nproc_per_node=$GPUS --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT /scripts/train.py /configs/official/Olmo-7B.yaml
The pretraining works as expected on a single node with multiple GPUs. However, when scaling to multiple nodes, we encounter the following error:
Traceback (most recent call last):
File "/scratch/amlt_code/OLmo-GFM/scripts/train.py", line 263, in <module>
main(cfg)
File "/scratch/amlt_code/OLmo-GFM/scripts/train.py", line 109, in main
train_loader = build_train_dataloader(cfg)
File "/scratch/amlt_code/OLmo-GFM/olmo/data/__init__.py", line 99, in build_train_dataloader
IterableDataset(
File "/scratch/amlt_code/OLmo-GFM/olmo/data/iterable_dataset.py", line 70, in __init__
self._build_and_save_global_indices()
File "/scratch/amlt_code/OLmo-GFM/olmo/data/iterable_dataset.py", line 79, in _build_and_save_global_indices
global_indices_mmap = np.memmap(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/numpy/core/memmap.py", line 230, in __new__
with f_ctx as fid:
OSError: [Errno 5] Input/output error
Additional Details:
- The error seems unrelated to GPU settings or data preparation.
- We suspect that the issue might be linked to the storage backend, as we are saving the checkpoint directory to a mounted Azure Blob Storage.
- The error specifically occurs during multi-node execution, which suggests potential problems with I/O handling in the distributed environment.
Request:
Has anyone on the team encountered a similar issue during development? Any insights or suggestions on troubleshooting this error would be greatly appreciated. We suspect that it may be related to the compatibility of the storage system with multi-node distributed training.
Versions
accelerate==0.34.2 -e git+https://github.com/Zehui127/OLmo-GFM.git@cd9edbb980a245aab29210d32a66e0e8b33ee4a5#egg=ai2_olmo aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 asttokens==2.4.1 async-timeout==4.0.3 attrs==24.2.0 backports.tarfile==1.2.0 beaker-gantry==1.8.3 beaker-py==1.31.2 beautifulsoup4==4.12.3 biopython==1.84 biotite==0.41.2 black==23.12.1 boltons==24.0.0 boto3==1.35.5 botocore==1.35.5 Brotli==1.1.0 build==1.2.1 cached_path==1.6.3 cachetools==5.5.0 certifi==2024.7.4 cffi==1.17.0 charset-normalizer==3.3.2 click==8.1.7 click-help-colors==0.9.4 cloudpathlib==0.18.1 contourpy==1.2.1 cryptography==43.0.0 cycler==0.12.1 datasets==2.21.0 decorator==5.1.1 dill==0.3.8 docker==7.1.0 docker-pycreds==0.4.0 docutils==0.21.2 einops==0.8.0 esm==3.0.2 evaluate==0.4.3 exceptiongroup==1.2.2 executing==2.0.1 face==20.1.1 filelock==3.13.4 fonttools==4.53.1 frozenlist==1.4.1 fsspec==2024.6.1 ftfy==6.2.3 gdown==5.2.0 gitdb==4.0.11 GitPython==3.1.43 glom==23.5.0 google-api-core==2.19.1 google-auth==2.34.0 google-cloud-core==2.4.1 google-cloud-storage==2.18.2 google-crc32c==1.5.0 google-resumable-media==2.7.2 googleapis-common-protos==1.63.2 huggingface-hub==0.23.5 idna==3.7 importlib_metadata==8.4.0 iniconfig==2.0.0 ipython==8.26.0 isort==5.12.0 jaraco.classes==3.4.0 jaraco.context==6.0.1 jaraco.functools==4.0.2 jedi==0.19.1 jeepney==0.8.0 Jinja2==3.1.4 jmespath==1.0.1 joblib==1.4.2 keyring==25.3.0 kiwisolver==1.4.5 lightning-utilities==0.11.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib==3.9.2 matplotlib-inline==0.1.7 mdurl==0.1.2 more-itertools==10.4.0 mpmath==1.3.0 msgpack==1.0.8 msgpack-numpy==0.4.8 msgspec==0.18.6 multidict==6.0.5 multiprocess==0.70.16 mypy==1.3.0 mypy-extensions==1.0.0 necessary==0.4.3 networkx==3.3 nh3==0.2.18 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.20 nvidia-nvtx-cu12==12.1.105 omegaconf==2.3.0 packaging==24.1 pandas==2.2.2 parso==0.8.4 pathspec==0.12.1 peft==0.12.0 petname==2.6 pexpect==4.9.0 pillow==10.4.0 pkginfo==1.10.0 platformdirs==4.2.2 pluggy==1.5.0 prompt_toolkit==3.0.47 proto-plus==1.24.0 protobuf==5.27.3 psutil==6.0.0 ptyprocess==0.7.0 pure_eval==0.2.3 pyarrow==17.0.0 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycparser==2.22 pydantic==2.8.2 pydantic_core==2.20.1 Pygments==2.18.0 pyparsing==3.1.2 pyproject_hooks==1.1.0 PySocks==1.7.1 pytest==8.3.2 pytest-sphinx==0.6.3 python-dateutil==2.9.0.post0 pytorch-lightning==2.4.0 pytz==2024.1 PyYAML==6.0.2 readme_renderer==44.0 regex==2024.7.24 requests==2.32.3 requests-toolbelt==1.0.0 requirements-parser==0.11.0 rfc3986==2.0.0 rich==13.7.1 rsa==4.9 ruff==0.6.2 s3transfer==0.10.2 safetensors==0.4.4 scikit-learn==1.5.1 scipy==1.14.0 seaborn==0.13.2 SecretStorage==3.3.3 sentry-sdk==2.13.0 setproctitle==1.3.3 six==1.16.0 smart-open==7.0.4 smashed==0.21.5 smmap==5.0.1 soupsieve==2.6 stack-data==0.6.3 sympy==1.13.1 threadpoolctl==3.5.0 tiktoken==0.7.0 tokenizers==0.19.1 tomli==2.0.1 torch==2.4.1 torchmetrics==1.4.1 torchtext==0.18.0 torchvision==0.19.1 tqdm==4.66.5 traitlets==5.14.3 transformers==4.44.0 triton==3.0.0 trouting==0.3.3 twine==5.1.1 types-setuptools==73.0.0.20240822 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.2 wandb==0.17.7 wcwidth==0.2.13 wrapt==1.16.0 xxhash==3.5.0 yarl==1.9.4 zipp==3.20.0