OLMoThreadError: generator thread data thread 0 failed
❓ The question
I use the default config configs/official/OLMo-1B.yaml and remove the wandb config. Then training the model at 8*A800. And run the cmd torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml for training.
There is no error for the training at the starting steps. But will pop up the error message after few steps of training after ~10 minutes.
[2024-08-18 12:26:53] INFO [olmo.train:966, rank=0] [step=1/739328,epoch=0]
optim/total_grad_norm=9.355
train/CrossEntropyLoss=11.35
train/Perplexity=84,618
throughput/total_tokens=4,194,304
throughput/total_training_Gflops=6,215,264
throughput/total_training_log_Gflops=15.64
System/Peak GPU Memory (MB)=42,083
[2024-08-18 12:28:06] INFO [olmo.train:966, rank=0] [step=2/739328,epoch=0]
optim/total_grad_norm=59.20
train/CrossEntropyLoss=10.57
train/Perplexity=38,880
throughput/total_tokens=8,388,608
throughput/total_training_Gflops=12,430,528
throughput/total_training_log_Gflops=16.34
throughput/device/tokens_per_second=20,592
throughput/device/batches_per_second=0.0393
System/Peak GPU Memory (MB)=43,259
[2024-08-18 12:29:21] INFO [olmo.train:966, rank=0] [step=3/739328,epoch=0]
optim/total_grad_norm=28.37
train/CrossEntropyLoss=10.70
train/Perplexity=44,302
throughput/total_tokens=12,582,912
throughput/total_training_Gflops=18,645,793
throughput/total_training_log_Gflops=16.74
throughput/device/tokens_per_second=10,506
throughput/device/batches_per_second=0.0200
[2024-08-18 12:30:37] INFO [olmo.train:966, rank=0] [step=4/739328,epoch=0]
optim/total_grad_norm=9.530
train/CrossEntropyLoss=11.06
train/Perplexity=63,518
throughput/total_tokens=16,777,216
throughput/total_training_Gflops=24,861,057
throughput/total_training_log_Gflops=17.03
throughput/device/tokens_per_second=8,949
throughput/device/batches_per_second=0.0171
[2024-08-18 12:31:54] INFO [olmo.train:966, rank=0] [step=5/739328,epoch=0]
optim/total_grad_norm=17.78
train/CrossEntropyLoss=10.69
train/Perplexity=44,126
throughput/total_tokens=20,971,520
throughput/total_training_Gflops=31,076,321
throughput/total_training_log_Gflops=17.25
throughput/device/tokens_per_second=8,271
throughput/device/batches_per_second=0.0158
[2024-08-18 12:33:13] INFO [olmo.train:966, rank=0] [step=6/739328,epoch=0]
optim/total_grad_norm=7.477
train/CrossEntropyLoss=10.23
train/Perplexity=27,666
throughput/total_tokens=25,165,824
throughput/total_training_Gflops=37,291,586
throughput/total_training_log_Gflops=17.43
throughput/device/tokens_per_second=7,890
throughput/device/batches_per_second=0.0151
[2024-08-18 12:33:54] CRITICAL [olmo.util:163, rank=5] Uncaught OLMoThreadError: generator thread data thread 0 failed
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/aaabbb/projects/OLMo/olmo/util.py:709 in fill_queue │
│ │
│ 706 │ │
│ 707 │ def fill_queue(): │
│ 708 │ │ try: │
│ ❱ 709 │ │ │ for value in g: │
│ 710 │ │ │ │ q.put(value) │
│ 711 │ │ except Exception as e: │
│ 712 │ │ │ q.put(e) │
│ │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:174 in <genexpr> │
│ │
│ 171 │ │ │ │
│ 172 │ │ │ thread_generators = [] │
│ 173 │ │ │ for i in range(num_threads): │
│ ❱ 174 │ │ │ │ generator = (self._get_dataset_item(int(idx)) for idx in indices[i::num_th │
│ 175 │ │ │ │ thread_generators.append( │
│ 176 │ │ │ │ │ threaded_generator(generator, maxsize=queue_size, thread_name=f"data t │
│ 177 │ │ │ │ ) │
│ │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:184 in _get_dataset_item │
│ │
│ 181 │ │ │ return (self._get_dataset_item(int(idx)) for idx in indices) │
│ 182 │ │
│ 183 │ def _get_dataset_item(self, idx: int) -> Dict[str, Any]: │
│ ❱ 184 │ │ item = self.dataset[idx] │
│ 185 │ │ if isinstance(item, dict): │
│ 186 │ │ │ return dict(**item, index=idx) │
│ 187 │ │ else: │
│ │
│ /data/aaabbb/projects/OLMo/olmo/data/memmap_dataset.py:196 in __getitem__ │
│ │
│ 193 │ │ │ raise IndexError(f"{index} is out of bounds for dataset of size {len(self)}") │
│ 194 │ │ │
│ 195 │ │ # Read the data from file. │
│ ❱ 196 │ │ input_ids = self._read_chunk_from_memmap(self._memmap_paths[memmap_index], memmap_ │
│ 197 │ │ out: Dict[str, Any] = {"input_ids": input_ids} │
│ 198 │ │ if self.instance_filter_config is not None: │
│ 199 │ │ │ out["instance_mask"] = self._validate_instance(input_ids) │
│ │
│ /data/aaabbb/projects/OLMo/olmo/data/memmap_dataset.py:162 in _read_chunk_from_memmap │
│ │
│ 159 │ │ item_size = dtype(0).itemsize │
│ 160 │ │ bytes_start = index * item_size * self._chunk_size │
│ 161 │ │ num_bytes = item_size * self._chunk_size │
│ ❱ 162 │ │ buffer = get_bytes_range(path, bytes_start, num_bytes) │
│ 163 │ │ array = np.frombuffer(buffer, dtype=dtype) │
│ 164 │ │ if dtype == np.bool_: │
│ 165 │ │ │ return torch.tensor(array) │
│ │
│ /data/aaabbb/projects/OLMo/olmo/util.py:375 in get_bytes_range │
│ │
│ 372 │ │ │ │ parsed.scheme, parsed.netloc, parsed.path.strip("/"), bytes_start, num_byt │
│ 373 │ │ │ ) │
│ 374 │ │ elif parsed.scheme in ("http", "https"): │
│ ❱ 375 │ │ │ return _http_get_bytes_range( │
│ 376 │ │ │ │ parsed.scheme, parsed.netloc, parsed.path.strip("/"), bytes_start, num_byt │
│ 377 │ │ │ ) │
│ 378 │ │ elif parsed.scheme == "file": │
│ │
│ /data/aaabbb/projects/OLMo/olmo/util.py:649 in _http_get_bytes_range │
│ │
│ 646 │ ) │
│ 647 │ result = response.content │
│ 648 │ assert ( │
│ ❱ 649 │ │ len(result) == num_bytes │
│ 650 │ ), f"expected {num_bytes} bytes, got {len(result)}" # Some web servers silently ignor │
│ 651 │ return result │
│ 652 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: expected 4096 bytes, got 175
The above exception was the direct cause of the following exception:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/aaabbb/projects/OLMo/scripts/train.py:345 in <module> │
│ │
│ 342 │ │ raise OLMoCliError(f"Usage: {sys.argv[0]} [CONFIG_PATH] [OPTIONS]") │
│ 343 │ │
│ 344 │ cfg = TrainConfig.load(yaml_path, [clean_opt(s) for s in args_list]) │
│ ❱ 345 │ main(cfg) │
│ 346 │
│ │
│ /data/aaabbb/projects/OLMo/scripts/train.py:317 in main │
│ │
│ 314 │ │ │
│ 315 │ │ if not cfg.dry_run: │
│ 316 │ │ │ log.info("Starting training...") │
│ ❱ 317 │ │ │ trainer.fit() │
│ 318 │ │ │ log.info("Training complete") │
│ 319 │ │ else: │
│ 320 │ │ │ log.info("Dry run complete") │
│ │
│ /data/aaabbb/projects/OLMo/olmo/train.py:1181 in fit │
│ │
│ 1178 │ │ │
│ 1179 │ │ with torch_profiler as p: │
│ 1180 │ │ │ for epoch in range(self.epoch or 0, self.max_epochs): │
│ ❱ 1181 │ │ │ │ for batch in self.train_loader: │
│ 1182 │ │ │ │ │ # Bookkeeping. │
│ 1183 │ │ │ │ │ # NOTE: To track the global batch size / number of tokens per batch w │
│ 1184 │ │ │ │ │ # batches see the same number of tokens, which should be the case for │
│ │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/dataload │
│ │
│ 628 │ │ │ if self._sampler_iter is None: │
│ 629 │ │ │ │ # TODO(https://github.com/pytorch/pytorch/issues/76750) │
│ 630 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 631 │ │ │ data = self._next_data() │
│ 632 │ │ │ self._num_yielded += 1 │
│ 633 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 634 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │
│ │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/dataload │
│ │
│ 672 │ │
│ 673 │ def _next_data(self): │
│ 674 │ │ index = self._next_index() # may raise StopIteration │
│ ❱ 675 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIteration │
│ 676 │ │ if self._pin_memory: │
│ 677 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memory_device) │
│ 678 │ │ return data │
│ │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/_utils/f │
│ │
│ 29 │ │ │ data = [] │
│ 30 │ │ │ for _ in possibly_batched_index: │
│ 31 │ │ │ │ try: │
│ ❱ 32 │ │ │ │ │ data.append(next(self.dataset_iter)) │
│ 33 │ │ │ │ except StopIteration: │
│ 34 │ │ │ │ │ self.ended = True │
│ 35 │ │ │ │ │ break │
│ │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:179 in <genexpr> │
│ │
│ 176 │ │ │ │ │ threaded_generator(generator, maxsize=queue_size, thread_name=f"data t │
│ 177 │ │ │ │ ) │
│ 178 │ │ │ │
│ ❱ 179 │ │ │ return (x for x in roundrobin(*thread_generators)) │
│ 180 │ │ else: │
│ 181 │ │ │ return (self._get_dataset_item(int(idx)) for idx in indices) │
│ 182 │
│ │
│ /data/aaabbb/projects/OLMo/olmo/util.py:738 in roundrobin │
│ │
│ 735 │ while num_active: │
│ 736 │ │ try: │
│ 737 │ │ │ for next in nexts: │
│ ❱ 738 │ │ │ │ yield next() │
│ 739 │ │ except StopIteration: │
│ 740 │ │ │ # Remove the iterator we just exhausted from the cycle. │
│ 741 │ │ │ num_active -= 1 │
│ │
│ /data/aaabbb/projects/OLMo/olmo/util.py:722 in threaded_generator │
│ │
│ 719 │ │
│ 720 │ for x in iter(q.get, sentinel): │
│ 721 │ │ if isinstance(x, Exception): │
│ ❱ 722 │ │ │ raise OLMoThreadError(f"generator thread {thread_name} failed") from x │
│ 723 │ │ else: │
│ 724 │ │ │ yield x │
│ 725 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OLMoThreadError: generator thread data thread 0 failed
W0818 12:33:57.741000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227462 closing signal SIGTERM
W0818 12:33:57.741000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227463 closing signal SIGTERM
W0818 12:33:57.742000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227464 closing signal SIGTERM
W0818 12:33:57.742000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227465 closing signal SIGTERM
W0818 12:33:57.743000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227466 closing signal SIGTERM
W0818 12:33:57.743000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227468 closing signal SIGTERM
W0818 12:33:57.744000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227469 closing signal SIGTERM
/home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
E0818 12:33:59.086000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 5 (pid: 1227467) of binary: /home/cde/anaconda3/envs/env_olmo_py311/bin/python
Traceback (most recent call last):
File "/home/cde/anaconda3/envs/env_olmo_py311/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-18_12:33:57
host : ubuntu
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 1227467)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(env_olmo_py311) cde@ubuntu:/data/aaabbb/projects/OLMo$
same issue here
The issue might be caused by accessing the dataset over the network, especially if you’re using HTML links to fetch data during training. Due to high network traffic or connection issues (you're calling more than the limit), instead of receiving the actual data tokens (e.g., 4096 bytes), you might be receiving an error response of around 175 bytes, which causes the training to fail after some time.
To resolve this, it’s recommended to download the dataset locally and access it from your storage. This way, you’ll avoid network interruptions and ensure that the training process runs smoothly without such errors.