Can't access training data in s3 storage
🐛 Describe the bug
Hi,
I'm trying to launch the local training of the "tiny" configuration (after fresh github clone and pip install):
torchrun scripts/train.py ./workspace/OLMo-20M/config.yaml --save_overwrite
However, I receive:
[olmo.util:623, rank=0] _s3_file_size failed attempt 2 with retriable error: An error occurred (403) when calling the HeadObject operation: Forbidden error
Then, trying to list the bucket content with: aws s3 ls s3://ai2-llm/preprocessed/megawika/v1/gpt-neox-olmo-dolma-v1_5/ --no-sign-request and get:
An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
Is the data public?
Screenshot from https://us-east-1.console.aws.amazon.com/s3/buckets/ai2-llm?region=us-east-1&bucketType=general&tab=objects
Or am I missing something?
Versions
(.venv) nickolay@leblanc:~/workspace/python/OLMo$ python --version && pip freeze Python 3.12.3 -e git+ssh://[email protected]/allenai/OLMo.git@04820704616af5d25cdba4df23aa7b4d9ce86cad#egg=ai2_olmo ai2-olmo-core==2.1.0 ai2-olmo-eval==0.7.1 aiohappyeyeballs==2.6.1 aiohttp==3.12.15 aiosignal==1.4.0 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 attrs==25.3.0 beaker-gantry==3.0.0 beaker-py==2.4.7 black==23.12.1 boltons==25.0.0 boto3==1.40.13 botocore==1.40.13 build==1.3.0 cached_path==1.7.3 cachetools==5.5.2 certifi==2025.8.3 cffi==1.17.1 charset-normalizer==3.4.3 click==8.2.1 click-help-colors==0.9.4 click-option-group==0.5.7 cryptography==45.0.6 datasets==4.0.0 dill==0.3.8 docutils==0.22 face==24.0.0 filelock==3.19.1 frozenlist==1.7.0 fsspec==2025.3.0 ftfy==6.3.1 gitdb==4.0.12 GitPython==3.1.45 glom==24.11.0 google-api-core==2.25.1 google-auth==2.40.3 google-cloud-core==2.4.3 google-cloud-storage==2.19.0 google-crc32c==1.7.1 google-resumable-media==2.7.2 googleapis-common-protos==1.70.0 grpcio==1.74.0 hf-xet==1.1.8 huggingface-hub==0.34.4 id==1.5.0 idna==3.10 importlib_resources==6.5.2 iniconfig==2.1.0 isort==5.12.0 jaraco.classes==3.4.0 jaraco.context==6.0.1 jaraco.functools==4.3.0 jeepney==0.9.0 Jinja2==3.1.6 jmespath==1.0.1 joblib==1.5.1 keyring==25.6.0 lightning-utilities==0.15.2 markdown-it-py==4.0.0 MarkupSafe==3.0.2 mdurl==0.1.2 more-itertools==10.7.0 mpmath==1.3.0 msgspec==0.19.0 multidict==6.6.4 multiprocess==0.70.16 mypy==1.3.0 mypy_extensions==1.1.0 necessary==0.4.3 networkx==3.5 nh3==0.3.0 numpy==1.26.4 nvidia-cublas-cu12==12.8.4.1 nvidia-cuda-cupti-cu12==12.8.90 nvidia-cuda-nvrtc-cu12==12.8.93 nvidia-cuda-runtime-cu12==12.8.90 nvidia-cudnn-cu12==9.10.2.21 nvidia-cufft-cu12==11.3.3.83 nvidia-cufile-cu12==1.13.1.3 nvidia-curand-cu12==10.3.9.90 nvidia-cusolver-cu12==11.7.3.90 nvidia-cusparse-cu12==12.5.8.93 nvidia-cusparselt-cu12==0.7.1 nvidia-nccl-cu12==2.27.3 nvidia-nvjitlink-cu12==12.8.93 nvidia-nvtx-cu12==12.8.90 omegaconf==2.3.0 packaging==25.0 pandas==2.3.1 pathspec==0.12.1 petname==2.6 platformdirs==4.3.8 pluggy==1.6.0 propcache==0.3.2 proto-plus==1.26.1 protobuf==5.29.5 pyarrow==21.0.0 pyasn1==0.6.1 pyasn1_modules==0.4.2 pycparser==2.22 pydantic==2.11.7 pydantic_core==2.33.2 Pygments==2.19.2 pyproject_hooks==1.2.0 pytest==8.4.1 pytest-sphinx==0.6.3 python-dateutil==2.9.0.post0 pytz==2025.2 PyYAML==6.0.2 readme_renderer==44.0 regex==2025.7.34 requests==2.32.5 requests-toolbelt==1.0.0 requirements-parser==0.13.0 rfc3986==2.0.0 rich==13.9.4 rsa==4.9.1 ruff==0.12.9 s3transfer==0.13.1 safetensors==0.6.2 scikit-learn==1.7.1 scipy==1.16.1 SecretStorage==3.3.3 sentry-sdk==2.35.0 setuptools==80.9.0 six==1.17.0 smart_open==7.3.0.post1 smashed==0.21.5 smmap==5.0.2 sympy==1.14.0 threadpoolctl==3.6.0 tokenizers==0.21.4 torch==2.8.0 torchmetrics==1.8.1 tqdm==4.67.1 transformers==4.55.2 triton==3.4.0 trouting==0.3.3 twine==6.1.0 typing-inspection==0.4.1 typing_extensions==4.14.1 tzdata==2025.2 urllib3==2.5.0 wandb==0.21.1 wcwidth==0.2.13 wheel==0.45.1 wrapt==1.17.3 xxhash==3.5.0 yarl==1.20.1
Hi there, thanks for pointing this out! The way the configs are set up right now, the urls are pointing to our private S3 storage bucket. The public data is located in our public Cloudflare R2 storage. I will put out a PR to update the urls with the public ones. In the meantime, if you need to run this ASAP, you can simply replace the prefixes of the private S3 urls with the public ones by replacing s3://ai2-llm with https://olmo-data.org.
For example, this url would go from:
s3://ai2-llm/eval-data/perplexity/v3_small_gptneox20b/c4_en/val/part-0-00000.npy
to
https://olmo-data.org/eval-data/perplexity/v3_small_gptneox20b/c4_en/val/part-0-00000.npy
Thank you, moved further to the following error, see the output below.
I also noticed, that when downloading files manually through the browser, the download speed is initially good (like 50-100Mb/s):
but then decreases to 100Kb/s
and it takes forever to complete. Any idea what it can be? Perhaps related to the error below?
(.venv) nickolay@leblanc:~/workspace/python/OLMo$ torchrun scripts/train.py ./workspace/OLMo-20M/config.yaml --save_overwrite
[2025-08-22 08:51:24] INFO [train:417, rank=0] CLI environment prepared
[2025-08-22 08:51:24] INFO [train:105, rank=0] Saving config to workspace/OLMo-20M/config.yaml
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.21.1
wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Run data is saved locally in workspace/OLMo-20M/wandb/wandb/offline-run-20250822_085126-o3zhrecp
[2025-08-22 08:51:26] INFO [botocore.credentials:1356, rank=0] Found credentials in shared credentials file: ~/.aws/credentials
Traceback (most recent call last):
File "/home/nickolay/workspace/python/OLMo/scripts/train.py", line 436, in <module>
main(cfg)
File "/home/nickolay/workspace/python/OLMo/scripts/train.py", line 132, in main
train_loader = build_train_dataloader(cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nickolay/workspace/python/OLMo/olmo/data/__init__.py", line 156, in build_train_dataloader
dataset = IterableDataset(
^^^^^^^^^^^^^^^^
File "/home/nickolay/workspace/python/OLMo/olmo/data/iterable_dataset.py", line 57, in __init__
if self.drop_last and len(self.dataset) % self.world_size != 0: # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^
File "/home/nickolay/workspace/python/OLMo/olmo/data/memmap_dataset.py", line 176, in __len__
self._num_instances = self.offsets[-1][1]
^^^^^^^^^^^^
File "/home/nickolay/workspace/python/OLMo/olmo/data/memmap_dataset.py", line 138, in offsets
path, length = future.result()
^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nickolay/workspace/python/OLMo/olmo/data/memmap_dataset.py", line 172, in _get_file_length
return path, file_size(path) // (item_size * self._chunk_size)
^^^^^^^^^^^^^^^
File "/home/nickolay/workspace/python/OLMo/olmo/util.py", line 344, in file_size
return _http_file_size(parsed.scheme, parsed.netloc, parsed.path.strip("/"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nickolay/workspace/python/OLMo/olmo/util.py", line 701, in _http_file_size
return int(response.headers.get("content-length"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
[2025-08-22 08:53:57] CRITICAL [olmo.util:168, rank=0] Uncaught TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/nickolay/workspace/python/OLMo/scripts/train.py:436 in <module> │
│ │
│ 433 │ │ log.info("Device is CPU. Updating config...") │
│ 434 │ │ cfg.model.init_device = "cpu" │
│ 435 │ │ cfg.distributed_strategy = "single" # type: ignore │
│ ❱ 436 │ main(cfg) │
│ 437 │
│ │
│ /home/nickolay/workspace/python/OLMo/scripts/train.py:132 in main │
│ │
│ 129 │ seed_all(cfg.seed) │
│ 130 │ │
│ 131 │ # Construct data loader. │
│ ❱ 132 │ train_loader = build_train_dataloader(cfg) │
│ 133 │ │
│ 134 │ # Construct evaluators. │
│ 135 │ evaluators = build_evaluators(cfg, device) │
│ │
│ /home/nickolay/workspace/python/OLMo/olmo/data/__init__.py:156 in build_train_dataloader │
│ │
│ 153 │ │ │ ) │
│ 154 │ │ else: │
│ 155 │ │ │ work_dir.mkdir(exist_ok=True, parents=True) │
│ ❱ 156 │ dataset = IterableDataset( │
│ 157 │ │ dataset, # type: ignore │
│ 158 │ │ train_config.global_train_batch_size, │
│ 159 │ │ seed=seed, │
│ │
│ /home/nickolay/workspace/python/OLMo/olmo/data/iterable_dataset.py:57 in __init__ │
│ │
│ 54 │ │ self.world_size = world_size if world_size is not None else get_world_size() │
│ 55 │ │ # If the dataset length is evenly divisible by # of replicas, then there │
│ 56 │ │ # is no need to drop any data, since the dataset will be split equally. │
│ ❱ 57 │ │ if self.drop_last and len(self.dataset) % self.world_size != 0: # type: ignore[ar │
│ 58 │ │ │ # Split to nearest available length that is evenly divisible by world size. │
│ 59 │ │ │ # This is to ensure each rank receives the same amount of data. │
│ 60 │ │ │ num_samples = math.ceil( │
│ │
│ /home/nickolay/workspace/python/OLMo/olmo/data/memmap_dataset.py:176 in __len__ │
│ │
│ 173 │ │
│ 174 │ def __len__(self) -> int: │
│ 175 │ │ if self._num_instances is None: │
│ ❱ 176 │ │ │ self._num_instances = self.offsets[-1][1] │
│ 177 │ │ return self._num_instances │
│ 178 │ │
│ 179 │ def __getitem__(self, index: int) -> Dict[str, Any]: │
│ │
│ /home/nickolay/workspace/python/OLMo/olmo/data/memmap_dataset.py:138 in offsets │
│ │
│ 135 │ │ │ │ │ │ mask_path_futures.append(executor.submit(self._get_file_length, ma │
│ 136 │ │ │ │ │
│ 137 │ │ │ │ for future in concurrent.futures.as_completed(path_futures): │
│ ❱ 138 │ │ │ │ │ path, length = future.result() │
│ 139 │ │ │ │ │ path_to_length[path] = length │
│ 140 │ │ │ │ │
│ 141 │ │ │ │ for future in concurrent.futures.as_completed(mask_path_futures): │
│ │
│ /usr/lib/python3.12/concurrent/futures/_base.py:449 in result │
│ │
│ 446 │ │ │ │ if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]: │
│ 447 │ │ │ │ │ raise CancelledError() │
│ 448 │ │ │ │ elif self._state == FINISHED: │
│ ❱ 449 │ │ │ │ │ return self.__get_result() │
│ 450 │ │ │ │ │
│ 451 │ │ │ │ self._condition.wait(timeout) │
│ 452 │
│ │
│ /usr/lib/python3.12/concurrent/futures/_base.py:401 in __get_result │
│ │
│ 398 │ def __get_result(self): │
│ 399 │ │ if self._exception: │
│ 400 │ │ │ try: │
│ ❱ 401 │ │ │ │ raise self._exception │
│ 402 │ │ │ finally: │
│ 403 │ │ │ │ # Break a reference cycle with the exception in self._exception │
│ 404 │ │ │ │ self = None │
│ │
│ /usr/lib/python3.12/concurrent/futures/thread.py:58 in run │
│ │
│ 55 │ │ │ return │
│ 56 │ │ │
│ 57 │ │ try: │
│ ❱ 58 │ │ │ result = self.fn(*self.args, **self.kwargs) │
│ 59 │ │ except BaseException as exc: │
│ 60 │ │ │ self.future.set_exception(exc) │
│ 61 │ │ │ # Break a reference cycle with the exception 'exc' │
│ │
│ /home/nickolay/workspace/python/OLMo/olmo/data/memmap_dataset.py:172 in _get_file_length │
│ │
│ 169 │ def _get_file_length(self, path, dtype=None) -> Tuple[PathOrStr, int]: │
│ 170 │ │ dtype = dtype or self.dtype │
│ 171 │ │ item_size = dtype(0).itemsize │
│ ❱ 172 │ │ return path, file_size(path) // (item_size * self._chunk_size) │
│ 173 │ │
│ 174 │ def __len__(self) -> int: │
│ 175 │ │ if self._num_instances is None: │
│ │
│ /home/nickolay/workspace/python/OLMo/olmo/util.py:344 in file_size │
│ │
│ 341 │ │ elif parsed.scheme in ("s3", "r2", "weka"): │
│ 342 │ │ │ return _s3_file_size(parsed.scheme, parsed.netloc, parsed.path.strip("/")) │
│ 343 │ │ elif parsed.scheme in ("http", "https"): │
│ ❱ 344 │ │ │ return _http_file_size(parsed.scheme, parsed.netloc, parsed.path.strip("/")) │
│ 345 │ │ elif parsed.scheme == "file": │
│ 346 │ │ │ return file_size(str(path).replace("file://", "", 1)) │
│ 347 │ │ else: │
│ │
│ /home/nickolay/workspace/python/OLMo/olmo/util.py:701 in _http_file_size │
│ │
│ 698 │ import requests │
│ 699 │ │
│ 700 │ response = requests.head(f"{scheme}://{host_name}/{path}", allow_redirects=True) │
│ ❱ 701 │ return int(response.headers.get("content-length")) │
│ 702 │
│ 703 │
│ 704 def _http_get_bytes_range(scheme: str, host_name: str, path: str, bytes_start: int, num_by │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync workspace/OLMo-20M/wandb/wandb/offline-run-20250822_085126-o3zhrecp
[rank0]:[W822 08:53:58.903613918 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0822 08:53:59.657000 80465 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 80490) of binary: /home/nickolay/workspace/python/OLMo/.venv/bin/python
Traceback (most recent call last):
File "/home/nickolay/workspace/python/OLMo/.venv/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/nickolay/workspace/python/OLMo/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nickolay/workspace/python/OLMo/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/nickolay/workspace/python/OLMo/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/nickolay/workspace/python/OLMo/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nickolay/workspace/python/OLMo/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-08-22_08:53:59
host : leblanc
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 80490)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Any idea what it can be? Does torchrun scripts/train.py ./workspace/OLMo-20M/config.yaml --save_overwrite work for you? (after replacing s3:// to http://)
Yes - I tried a few things and I believe the issue is that we fixed up some of the files in one of the directories and renamed it, so the path to that directory in the current config no longer exists.
The current config for red pajama stack exchange contains paths like:
../preprocessed/redpajama_stackexchange_only/v1_decontaminated/gpt-neox-olmo-dolma-v1_5/part-00-00000.npy
and looking at R2, looks like we went back and updated those files and changed the paths to:
../preprocessed/redpajama_v1_decon_fix/stackexchange/gpt-neox-olmo-dolma-v1_5/part-00-00000.npy
so the requests to the bad links return None for content length in _http_file_size. Hopefully this swap should fix it. I added the fixed file paths in my branch here that you can use until it's merged: https://github.com/allenai/OLMo/blob/bk-fix-olmo-tiny/configs/tiny/OLMo-20M.yaml
Generally, download speed from R2 is going to be a bit slow. If you want to speed things up, you could download this data locally and then change the paths in the config to wherever your local copy of those files are.
Thank you, will try that. Looks like a test coverage is somewhat limited.
Still does not work (same error), I guess a link to some other file is broken. Is it possible to download the "raw" unprocessed data somewhere and then run the pre-processing pipeline on them?
Sorry about that - looks like there was a botched path that got overwritten while I was fixing the urls. I launched a run on my branch with the updated config (botched path fixed) using:
torchrun --nproc_per_node=1 scripts/train.py configs/tiny/OLMo-20M.yaml
and confirmed that the run kicked off and ran successfully for the first 1000 steps. Can you try again with the fixed config? Unfortunately, we don't have the raw, untokenized data for these tiny models uploaded publicly.
Also - the config contains a remote_save_folder, which currently points to a private s3 bucket, so you may want to update that as well.
Thank you, finally managed to launch the training! (with some patching)
It trained for 5000 steps, saved a checkpoint, then failed with OOM error (weird, some left-overs in memory after the saving/uploading process?). ~~How do I resume with the latest checkpoint?~~ Nevermind, found try_load_latest_save config.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.28 GiB. GPU 0 has a total capacity of 23.56 GiB of which 3.72 GiB is free. Including non-PyTorch memory, this process has 19.82 GiB memory in use. Of the allocated memory 18.99 GiB is allocated by PyTorch, and 70.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
So it seems after saving a checkpoint and starting a new cycle, memory from the old cycle is not freed.