OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Running run_dataloader.py is very slow

Open andrewivan123 opened this issue 5 months ago • 5 comments

🐛 Describe the bug

I wanted to get the ordered training data for the first n indices by running run_dataloader.py. However, I noticed that running it is very slow. I leave it for 1 night and it has only finished 44 batches. I would like to get the first 20k steps. I expect it to be done in around 2 days as it is the rough training time using 8 H100 cards when the dataset has been downloaded. Is there a problem with olmo's data server? Is there a way to improve the speed?

Image

Versions

Python 3.13.5 absl-py==2.3.1 accelerate==1.8.1 -e git+https://github.com/allenai/OLMo.git@f3dff833c880add075b123df9ddc31423086ef31#egg=ai2_olmo ai2-olmo-core==2.1.0 ai2-olmo-eval==0.7.1 aiohappyeyeballs==2.6.1 aiohttp==3.12.13 aiosignal==1.3.2 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 attrs==25.3.0 beaker-gantry==2.7.1 beaker-py==2.4.4 black==23.12.1 blessed==1.21.0 boltons==25.0.0 boto3==1.38.46 botocore==1.38.46 build==1.2.2.post1 cached_path==1.7.3 cachetools==5.5.2 certifi==2025.6.15 cffi==1.17.1 chardet==5.2.0 charset-normalizer==2.0.12 click==8.2.1 click-help-colors==0.9.4 click-option-group==0.5.7 colorama==0.4.6 cryptography==45.0.4 DataProperty==1.1.0 datasets==3.6.0 dill==0.3.8 docutils==0.21.2 einops==0.8.1 enlighten==1.10.1 evaluate==0.4.5 face==24.0.0 filelock==3.18.0 flash_attn==2.8.0.post2 frozenlist==1.7.0 fsspec==2025.3.0 ftfy==6.3.1 gitdb==4.0.12 GitPython==3.1.44 glom==24.11.0 google-api-core==2.25.1 google-auth==2.40.3 google-cloud-core==2.4.3 google-cloud-storage==2.19.0 google-crc32c==1.7.1 google-resumable-media==2.7.2 googleapis-common-protos==1.70.0 grpcio==1.73.1 hf-xet==1.1.5 huggingface-hub==0.33.1 id==1.5.0 idna==3.10 importlib_resources==6.5.2 iniconfig==2.1.0 isort==5.12.0 jaraco.classes==3.4.0 jaraco.context==6.0.1 jaraco.functools==4.2.1 jeepney==0.9.0 Jinja2==3.1.6 jmespath==1.0.1 joblib==1.5.1 jsonlines==4.0.0 keyring==25.6.0 latexcodec==3.0.1 lightning-utilities==0.14.3 -e git+https://github.com/EleutherAI/lm-evaluation-harness@fcddf195ec6bb69c63e36d54d75354f6ecaabab7#egg=lm_eval lxml==6.0.0 markdown-it-py==3.0.0 MarkupSafe==3.0.2 mbstrdecoder==1.1.4 mdurl==0.1.2 more-itertools==10.7.0 mpmath==1.3.0 msgspec==0.19.0 mtdata==0.4.0 multidict==6.6.2 multiprocess==0.70.16 mypy==1.3.0 mypy_extensions==1.1.0 necessary==0.4.3 networkx==3.5 nh3==0.2.21 nltk==3.9.1 numexpr==2.11.0 numpy==1.26.4 nvidia-cublas-cu12==12.6.4.1 nvidia-cuda-cupti-cu12==12.6.80 nvidia-cuda-nvrtc-cu12==12.6.77 nvidia-cuda-runtime-cu12==12.6.77 nvidia-cudnn-cu12==9.5.1.17 nvidia-cufft-cu12==11.3.0.4 nvidia-cufile-cu12==1.11.1.6 nvidia-curand-cu12==10.3.7.77 nvidia-cusolver-cu12==11.7.1.2 nvidia-cusparse-cu12==12.5.4.2 nvidia-cusparselt-cu12==0.6.3 nvidia-nccl-cu12==2.26.2 nvidia-nvjitlink-cu12==12.6.85 nvidia-nvtx-cu12==12.6.77 omegaconf==2.3.0 packaging==25.0 pandas==2.3.0 pathspec==0.12.1 pathvalidate==3.3.1 peft==0.16.0 petname==2.6 platformdirs==4.3.8 pluggy==1.6.0 portalocker==2.3.0 prefixed==0.9.0 propcache==0.3.2 proto-plus==1.26.1 protobuf==5.29.5 psutil==7.0.0 pyarrow==20.0.0 pyasn1==0.6.1 pyasn1_modules==0.4.2 pybind11==3.0.0 pybtex==0.24.0 pycparser==2.22 pydantic==2.11.7 pydantic_core==2.33.2 Pygments==2.19.2 pyproject_hooks==1.2.0 pytablewriter==1.2.1 pytest==8.4.1 pytest-sphinx==0.6.3 python-dateutil==2.9.0.post0 pytz==2025.2 PyYAML==6.0.2 readme_renderer==44.0 regex==2024.11.6 requests==2.32.4 requests-toolbelt==1.0.0 requirements-parser==0.13.0 rfc3986==2.0.0 rich==13.9.4 rouge_score==0.1.2 rsa==4.9.1 ruamel.yaml==0.18.14 ruamel.yaml.clib==0.2.12 ruff==0.12.1 s3transfer==0.13.0 sacrebleu==2.5.1 safetensors==0.5.3 scikit-learn==1.7.0 scipy==1.16.0 SecretStorage==3.3.3 sentencepiece @ file:///croot/sentencepiece-split_1742566759237/work/python sentry-sdk==2.32.0 setproctitle==1.3.6 setuptools==78.1.1 six==1.17.0 smart-open==7.1.0 smashed==0.21.5 smmap==5.0.2 sqlitedict==2.1.0 sympy==1.14.0 tabledata==1.3.4 tabulate==0.9.0 tcolorpy==0.1.7 threadpoolctl==3.6.0 tokenizers==0.21.2 torch==2.7.1 torchmetrics==1.7.3 tqdm==4.67.1 tqdm-multiprocess==0.0.11 -e git+https://github.com/huggingface/transformers.git@67ddc82fbc7e52c6f42a395b4a6d278c55b77a39#egg=transformers triton==3.3.1 trouting==0.3.3 twine==6.1.0 typepy==1.3.4 typing-inspection==0.4.1 typing_extensions==4.14.0 tzdata==2025.2 urllib3==1.26.20 wandb==0.20.1 wcwidth==0.2.13 wheel==0.45.1 wmtformat @ git+https://github.com/wmt-conference/wmt-format-tools.git@d46d4d75cf47095fbe7b15da29afd9348dfafeb1 word2number==1.1 wrapt==1.17.2 xxhash==3.5.0 yarl==1.20.1 zstandard==0.23.0

andrewivan123 avatar Jul 26 '25 01:07 andrewivan123

Hi there! While run_dataloader.py can get you the data order, it's typically used for debugging issues with the data loader and thus is not optimized for speed, though it does seem to be particularly slow from what you've mentioned above. If you are just looking to train, this script is not necessary for training -- training order will be automatically replicated when you launch the trainer. It will expectedly run slower than actual training because the training loop more efficiently executes some of the things happening in this script.

The config you are passing in (OLMo2-1B-stage1.yaml) points to R2 urls, which may not be reliable for a long-running job like this. If you do still want to run this script to get the data order explicitly, you could first locally download the data from those R2 URLs and then replace the R2 paths with the paths to those locally downloaded files in this commented out code in run_dataloader. You can uncomment with the updated local paths to speed it up.

baileykuehl avatar Jul 31 '25 20:07 baileykuehl

I would like to train for only 10000 steps and I do not have space to save all of the R2 urls, hence is there other way to only get the first n steps of the ordered training data without downloading all of the dataset?

andrewivan123 avatar Aug 01 '25 05:08 andrewivan123

If you only need the data order to train, you do not need to explicitly get the data order or use this script. You can just add the --stop_at parameter when training, and as long as you keep the same config (which looks like you are), it will reproduce the data order used in those steps.

For example:

torchrun --nproc_per_node=8 scripts/train.py configs/official-0425/OLMo2-1B-stage1.yaml \
  --stop_at=10000

baileykuehl avatar Aug 11 '25 00:08 baileykuehl

I would like to get the first n step dataset itself instead of training on the dataset as I would like to retokenize it with some other tokenizer

andrewivan123 avatar Aug 11 '25 03:08 andrewivan123

You should be able to use this script: https://github.com/allenai/OLMo/blob/main/scripts/inspect_train_data.py

This script will give you the global indices, which will let you reproduce the training sequence without actually running the data_loader.

baileykuehl avatar Aug 12 '25 18:08 baileykuehl