OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Error Encountered During Multi-Node Pretraining with Torchrun

Open Zehui127 opened this issue 1 year ago • 0 comments

🐛 Describe the bug

Description:

We are conducting pretraining using our own data with the following torchrun command:

torchrun --nnodes=$NODES --nproc_per_node=$GPUS --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT /scripts/train.py /configs/official/Olmo-7B.yaml

The pretraining works as expected on a single node with multiple GPUs. However, when scaling to multiple nodes, we encounter the following error:

Traceback (most recent call last):
  File "/scratch/amlt_code/OLmo-GFM/scripts/train.py", line 263, in <module>
    main(cfg)
  File "/scratch/amlt_code/OLmo-GFM/scripts/train.py", line 109, in main
    train_loader = build_train_dataloader(cfg)
  File "/scratch/amlt_code/OLmo-GFM/olmo/data/__init__.py", line 99, in build_train_dataloader
    IterableDataset(
  File "/scratch/amlt_code/OLmo-GFM/olmo/data/iterable_dataset.py", line 70, in __init__
    self._build_and_save_global_indices()
  File "/scratch/amlt_code/OLmo-GFM/olmo/data/iterable_dataset.py", line 79, in _build_and_save_global_indices
    global_indices_mmap = np.memmap(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/numpy/core/memmap.py", line 230, in __new__
    with f_ctx as fid:
OSError: [Errno 5] Input/output error

Additional Details:

  • The error seems unrelated to GPU settings or data preparation.
  • We suspect that the issue might be linked to the storage backend, as we are saving the checkpoint directory to a mounted Azure Blob Storage.
  • The error specifically occurs during multi-node execution, which suggests potential problems with I/O handling in the distributed environment.

Request:

Has anyone on the team encountered a similar issue during development? Any insights or suggestions on troubleshooting this error would be greatly appreciated. We suspect that it may be related to the compatibility of the storage system with multi-node distributed training.

Versions

accelerate==0.34.2 -e git+https://github.com/Zehui127/OLmo-GFM.git@cd9edbb980a245aab29210d32a66e0e8b33ee4a5#egg=ai2_olmo aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 asttokens==2.4.1 async-timeout==4.0.3 attrs==24.2.0 backports.tarfile==1.2.0 beaker-gantry==1.8.3 beaker-py==1.31.2 beautifulsoup4==4.12.3 biopython==1.84 biotite==0.41.2 black==23.12.1 boltons==24.0.0 boto3==1.35.5 botocore==1.35.5 Brotli==1.1.0 build==1.2.1 cached_path==1.6.3 cachetools==5.5.0 certifi==2024.7.4 cffi==1.17.0 charset-normalizer==3.3.2 click==8.1.7 click-help-colors==0.9.4 cloudpathlib==0.18.1 contourpy==1.2.1 cryptography==43.0.0 cycler==0.12.1 datasets==2.21.0 decorator==5.1.1 dill==0.3.8 docker==7.1.0 docker-pycreds==0.4.0 docutils==0.21.2 einops==0.8.0 esm==3.0.2 evaluate==0.4.3 exceptiongroup==1.2.2 executing==2.0.1 face==20.1.1 filelock==3.13.4 fonttools==4.53.1 frozenlist==1.4.1 fsspec==2024.6.1 ftfy==6.2.3 gdown==5.2.0 gitdb==4.0.11 GitPython==3.1.43 glom==23.5.0 google-api-core==2.19.1 google-auth==2.34.0 google-cloud-core==2.4.1 google-cloud-storage==2.18.2 google-crc32c==1.5.0 google-resumable-media==2.7.2 googleapis-common-protos==1.63.2 huggingface-hub==0.23.5 idna==3.7 importlib_metadata==8.4.0 iniconfig==2.0.0 ipython==8.26.0 isort==5.12.0 jaraco.classes==3.4.0 jaraco.context==6.0.1 jaraco.functools==4.0.2 jedi==0.19.1 jeepney==0.8.0 Jinja2==3.1.4 jmespath==1.0.1 joblib==1.4.2 keyring==25.3.0 kiwisolver==1.4.5 lightning-utilities==0.11.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib==3.9.2 matplotlib-inline==0.1.7 mdurl==0.1.2 more-itertools==10.4.0 mpmath==1.3.0 msgpack==1.0.8 msgpack-numpy==0.4.8 msgspec==0.18.6 multidict==6.0.5 multiprocess==0.70.16 mypy==1.3.0 mypy-extensions==1.0.0 necessary==0.4.3 networkx==3.3 nh3==0.2.18 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.20 nvidia-nvtx-cu12==12.1.105 omegaconf==2.3.0 packaging==24.1 pandas==2.2.2 parso==0.8.4 pathspec==0.12.1 peft==0.12.0 petname==2.6 pexpect==4.9.0 pillow==10.4.0 pkginfo==1.10.0 platformdirs==4.2.2 pluggy==1.5.0 prompt_toolkit==3.0.47 proto-plus==1.24.0 protobuf==5.27.3 psutil==6.0.0 ptyprocess==0.7.0 pure_eval==0.2.3 pyarrow==17.0.0 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycparser==2.22 pydantic==2.8.2 pydantic_core==2.20.1 Pygments==2.18.0 pyparsing==3.1.2 pyproject_hooks==1.1.0 PySocks==1.7.1 pytest==8.3.2 pytest-sphinx==0.6.3 python-dateutil==2.9.0.post0 pytorch-lightning==2.4.0 pytz==2024.1 PyYAML==6.0.2 readme_renderer==44.0 regex==2024.7.24 requests==2.32.3 requests-toolbelt==1.0.0 requirements-parser==0.11.0 rfc3986==2.0.0 rich==13.7.1 rsa==4.9 ruff==0.6.2 s3transfer==0.10.2 safetensors==0.4.4 scikit-learn==1.5.1 scipy==1.14.0 seaborn==0.13.2 SecretStorage==3.3.3 sentry-sdk==2.13.0 setproctitle==1.3.3 six==1.16.0 smart-open==7.0.4 smashed==0.21.5 smmap==5.0.1 soupsieve==2.6 stack-data==0.6.3 sympy==1.13.1 threadpoolctl==3.5.0 tiktoken==0.7.0 tokenizers==0.19.1 tomli==2.0.1 torch==2.4.1 torchmetrics==1.4.1 torchtext==0.18.0 torchvision==0.19.1 tqdm==4.66.5 traitlets==5.14.3 transformers==4.44.0 triton==3.0.0 trouting==0.3.3 twine==5.1.1 types-setuptools==73.0.0.20240822 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.2 wandb==0.17.7 wcwidth==0.2.13 wrapt==1.16.0 xxhash==3.5.0 yarl==1.9.4 zipp==3.20.0

Zehui127 avatar Oct 21 '24 07:10 Zehui127