OLMoThreadError: generator thread data thread 0 failed

Open ybdesire opened this issue 1 year ago • 1 comments
❓ The question

I use the default config configs/official/OLMo-1B.yaml and remove the wandb config. Then training the model at 8*A800. And run the cmd torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml for training.
There is no error for the training at the starting steps. But will pop up the error message after few steps of training after ~10 minutes.
[2024-08-18 12:26:53] INFO     [olmo.train:966, rank=0] [step=1/739328,epoch=0]
    optim/total_grad_norm=9.355
    train/CrossEntropyLoss=11.35
    train/Perplexity=84,618
    throughput/total_tokens=4,194,304
    throughput/total_training_Gflops=6,215,264
    throughput/total_training_log_Gflops=15.64
    System/Peak GPU Memory (MB)=42,083
[2024-08-18 12:28:06] INFO     [olmo.train:966, rank=0] [step=2/739328,epoch=0]
    optim/total_grad_norm=59.20
    train/CrossEntropyLoss=10.57
    train/Perplexity=38,880
    throughput/total_tokens=8,388,608
    throughput/total_training_Gflops=12,430,528
    throughput/total_training_log_Gflops=16.34
    throughput/device/tokens_per_second=20,592
    throughput/device/batches_per_second=0.0393
    System/Peak GPU Memory (MB)=43,259
[2024-08-18 12:29:21] INFO     [olmo.train:966, rank=0] [step=3/739328,epoch=0]
    optim/total_grad_norm=28.37
    train/CrossEntropyLoss=10.70
    train/Perplexity=44,302
    throughput/total_tokens=12,582,912
    throughput/total_training_Gflops=18,645,793
    throughput/total_training_log_Gflops=16.74
    throughput/device/tokens_per_second=10,506
    throughput/device/batches_per_second=0.0200
[2024-08-18 12:30:37] INFO     [olmo.train:966, rank=0] [step=4/739328,epoch=0]
    optim/total_grad_norm=9.530
    train/CrossEntropyLoss=11.06
    train/Perplexity=63,518
    throughput/total_tokens=16,777,216
    throughput/total_training_Gflops=24,861,057
    throughput/total_training_log_Gflops=17.03
    throughput/device/tokens_per_second=8,949
    throughput/device/batches_per_second=0.0171
[2024-08-18 12:31:54] INFO     [olmo.train:966, rank=0] [step=5/739328,epoch=0]
    optim/total_grad_norm=17.78
    train/CrossEntropyLoss=10.69
    train/Perplexity=44,126
    throughput/total_tokens=20,971,520
    throughput/total_training_Gflops=31,076,321
    throughput/total_training_log_Gflops=17.25
    throughput/device/tokens_per_second=8,271
    throughput/device/batches_per_second=0.0158
[2024-08-18 12:33:13] INFO     [olmo.train:966, rank=0] [step=6/739328,epoch=0]
    optim/total_grad_norm=7.477
    train/CrossEntropyLoss=10.23
    train/Perplexity=27,666
    throughput/total_tokens=25,165,824
    throughput/total_training_Gflops=37,291,586
    throughput/total_training_log_Gflops=17.43
    throughput/device/tokens_per_second=7,890
    throughput/device/batches_per_second=0.0151
[2024-08-18 12:33:54] CRITICAL [olmo.util:163, rank=5] Uncaught OLMoThreadError: generator thread data thread 0 failed
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/aaabbb/projects/OLMo/olmo/util.py:709 in fill_queue                                        │
│                                                                                                  │
│   706 │                                                                                          │
│   707 │   def fill_queue():                                                                      │
│   708 │   │   try:                                                                               │
│ ❱ 709 │   │   │   for value in g:                                                                │
│   710 │   │   │   │   q.put(value)                                                               │
│   711 │   │   except Exception as e:                                                             │
│   712 │   │   │   q.put(e)                                                                       │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:174 in <genexpr>                        │
│                                                                                                  │
│   171 │   │   │                                                                                  │
│   172 │   │   │   thread_generators = []                                                         │
│   173 │   │   │   for i in range(num_threads):                                                   │
│ ❱ 174 │   │   │   │   generator = (self._get_dataset_item(int(idx)) for idx in indices[i::num_th │
│   175 │   │   │   │   thread_generators.append(                                                  │
│   176 │   │   │   │   │   threaded_generator(generator, maxsize=queue_size, thread_name=f"data t │
│   177 │   │   │   │   )                                                                          │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:184 in _get_dataset_item                │
│                                                                                                  │
│   181 │   │   │   return (self._get_dataset_item(int(idx)) for idx in indices)                   │
│   182 │                                                                                          │
│   183 │   def _get_dataset_item(self, idx: int) -> Dict[str, Any]:                               │
│ ❱ 184 │   │   item = self.dataset[idx]                                                           │
│   185 │   │   if isinstance(item, dict):                                                         │
│   186 │   │   │   return dict(**item, index=idx)                                                 │
│   187 │   │   else:                                                                              │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/data/memmap_dataset.py:196 in __getitem__                        │
│                                                                                                  │
│   193 │   │   │   raise IndexError(f"{index} is out of bounds for dataset of size {len(self)}")  │
│   194 │   │                                                                                      │
│   195 │   │   # Read the data from file.                                                         │
│ ❱ 196 │   │   input_ids = self._read_chunk_from_memmap(self._memmap_paths[memmap_index], memmap_ │
│   197 │   │   out: Dict[str, Any] = {"input_ids": input_ids}                                     │
│   198 │   │   if self.instance_filter_config is not None:                                        │
│   199 │   │   │   out["instance_mask"] = self._validate_instance(input_ids)                      │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/data/memmap_dataset.py:162 in _read_chunk_from_memmap            │
│                                                                                                  │
│   159 │   │   item_size = dtype(0).itemsize                                                      │
│   160 │   │   bytes_start = index * item_size * self._chunk_size                                 │
│   161 │   │   num_bytes = item_size * self._chunk_size                                           │
│ ❱ 162 │   │   buffer = get_bytes_range(path, bytes_start, num_bytes)                             │
│   163 │   │   array = np.frombuffer(buffer, dtype=dtype)                                         │
│   164 │   │   if dtype == np.bool_:                                                              │
│   165 │   │   │   return torch.tensor(array)                                                     │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/util.py:375 in get_bytes_range                                   │
│                                                                                                  │
│   372 │   │   │   │   parsed.scheme, parsed.netloc, parsed.path.strip("/"), bytes_start, num_byt │
│   373 │   │   │   )                                                                              │
│   374 │   │   elif parsed.scheme in ("http", "https"):                                           │
│ ❱ 375 │   │   │   return _http_get_bytes_range(                                                  │
│   376 │   │   │   │   parsed.scheme, parsed.netloc, parsed.path.strip("/"), bytes_start, num_byt │
│   377 │   │   │   )                                                                              │
│   378 │   │   elif parsed.scheme == "file":                                                      │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/util.py:649 in _http_get_bytes_range                             │
│                                                                                                  │
│   646 │   )                                                                                      │
│   647 │   result = response.content                                                              │
│   648 │   assert (                                                                               │
│ ❱ 649 │   │   len(result) == num_bytes                                                           │
│   650 │   ), f"expected {num_bytes} bytes, got {len(result)}"  # Some web servers silently ignor │
│   651 │   return result                                                                          │
│   652                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: expected 4096 bytes, got 175

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/aaabbb/projects/OLMo/scripts/train.py:345 in <module>                                      │
│                                                                                                  │
│   342 │   │   raise OLMoCliError(f"Usage: {sys.argv[0]} [CONFIG_PATH] [OPTIONS]")                │
│   343 │                                                                                          │
│   344 │   cfg = TrainConfig.load(yaml_path, [clean_opt(s) for s in args_list])                   │
│ ❱ 345 │   main(cfg)                                                                              │
│   346                                                                                            │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/scripts/train.py:317 in main                                          │
│                                                                                                  │
│   314 │   │                                                                                      │
│   315 │   │   if not cfg.dry_run:                                                                │
│   316 │   │   │   log.info("Starting training...")                                               │
│ ❱ 317 │   │   │   trainer.fit()                                                                  │
│   318 │   │   │   log.info("Training complete")                                                  │
│   319 │   │   else:                                                                              │
│   320 │   │   │   log.info("Dry run complete")                                                   │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/train.py:1181 in fit                                             │
│                                                                                                  │
│   1178 │   │                                                                                     │
│   1179 │   │   with torch_profiler as p:                                                         │
│   1180 │   │   │   for epoch in range(self.epoch or 0, self.max_epochs):                         │
│ ❱ 1181 │   │   │   │   for batch in self.train_loader:                                           │
│   1182 │   │   │   │   │   # Bookkeeping.                                                        │
│   1183 │   │   │   │   │   # NOTE: To track the global batch size / number of tokens per batch w │
│   1184 │   │   │   │   │   # batches see the same number of tokens, which should be the case for │
│                                                                                                  │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/dataload │
│                                                                                                  │
│    628 │   │   │   if self._sampler_iter is None:                                                │
│    629 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/76750)                   │
│    630 │   │   │   │   self._reset()  # type: ignore[call-arg]                                   │
│ ❱  631 │   │   │   data = self._next_data()                                                      │
│    632 │   │   │   self._num_yielded += 1                                                        │
│    633 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    634 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/dataload │
│                                                                                                  │
│    672 │                                                                                         │
│    673 │   def _next_data(self):                                                                 │
│    674 │   │   index = self._next_index()  # may raise StopIteration                             │
│ ❱  675 │   │   data = self._dataset_fetcher.fetch(index)  # may raise StopIteration              │
│    676 │   │   if self._pin_memory:                                                              │
│    677 │   │   │   data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)            │
│    678 │   │   return data                                                                       │
│                                                                                                  │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/_utils/f │
│                                                                                                  │
│   29 │   │   │   data = []                                                                       │
│   30 │   │   │   for _ in possibly_batched_index:                                                │
│   31 │   │   │   │   try:                                                                        │
│ ❱ 32 │   │   │   │   │   data.append(next(self.dataset_iter))                                    │
│   33 │   │   │   │   except StopIteration:                                                       │
│   34 │   │   │   │   │   self.ended = True                                                       │
│   35 │   │   │   │   │   break                                                                   │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:179 in <genexpr>                        │
│                                                                                                  │
│   176 │   │   │   │   │   threaded_generator(generator, maxsize=queue_size, thread_name=f"data t │
│   177 │   │   │   │   )                                                                          │
│   178 │   │   │                                                                                  │
│ ❱ 179 │   │   │   return (x for x in roundrobin(*thread_generators))                             │
│   180 │   │   else:                                                                              │
│   181 │   │   │   return (self._get_dataset_item(int(idx)) for idx in indices)                   │
│   182                                                                                            │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/util.py:738 in roundrobin                                        │
│                                                                                                  │
│   735 │   while num_active:                                                                      │
│   736 │   │   try:                                                                               │
│   737 │   │   │   for next in nexts:                                                             │
│ ❱ 738 │   │   │   │   yield next()                                                               │
│   739 │   │   except StopIteration:                                                              │
│   740 │   │   │   # Remove the iterator we just exhausted from the cycle.                        │
│   741 │   │   │   num_active -= 1                                                                │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/util.py:722 in threaded_generator                                │
│                                                                                                  │
│   719 │                                                                                          │
│   720 │   for x in iter(q.get, sentinel):                                                        │
│   721 │   │   if isinstance(x, Exception):                                                       │
│ ❱ 722 │   │   │   raise OLMoThreadError(f"generator thread {thread_name} failed") from x         │
│   723 │   │   else:                                                                              │
│   724 │   │   │   yield x                                                                        │
│   725                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OLMoThreadError: generator thread data thread 0 failed
W0818 12:33:57.741000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227462 closing signal SIGTERM
W0818 12:33:57.741000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227463 closing signal SIGTERM
W0818 12:33:57.742000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227464 closing signal SIGTERM
W0818 12:33:57.742000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227465 closing signal SIGTERM
W0818 12:33:57.743000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227466 closing signal SIGTERM
W0818 12:33:57.743000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227468 closing signal SIGTERM
W0818 12:33:57.744000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227469 closing signal SIGTERM
/home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
E0818 12:33:59.086000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 5 (pid: 1227467) of binary: /home/cde/anaconda3/envs/env_olmo_py311/bin/python
Traceback (most recent call last):
  File "/home/cde/anaconda3/envs/env_olmo_py311/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-18_12:33:57
  host      : ubuntu
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 1227467)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(env_olmo_py311) cde@ubuntu:/data/aaabbb/projects/OLMo$
Aug 18 '24 12:08 ybdesire
same issue here
Oct 07 '24 15:10 NonvolatileMemory
The issue might be caused by accessing the dataset over the network, especially if you’re using HTML links to fetch data during training. Due to high network traffic or connection issues (you're calling more than the limit), instead of receiving the actual data tokens (e.g., 4096 bytes), you might be receiving an error response of around 175 bytes, which causes the training to fail after some time.
To resolve this, it’s recommended to download the dataset locally and access it from your storage. This way, you’ll avoid network interruptions and ensure that the training process runs smoothly without such errors.
Oct 22 '24 18:10 aman-17