data icon indicating copy to clipboard operation
data copied to clipboard

Dataloader2 with FullSyncIterDataPipe throws error during initilization

Open chenxingyu-cs opened this issue 1 year ago • 3 comments

🐛 Describe the bug

Hi, we found some strange during using Dataloader2. Here's some details about the issue.

  • We are a long run training job with 8 AWS P4 nodes. It's using HuggingFace trainer.
  • In HuggingFace training, it will call evaluation every traininig_args.eval_steps training steps.
  • I overrided the HF trainer to use Dataloader2 with training, evaluation and test dataset loading. At the same time, on the dataset part, I'm using IterableDataPipe with ShardingFilterIterDataPipe
  • The issue that listed the log happens randomly. And most time it happens after the job runs for a long time (e.g. 20+ hours)

Can you help provide some context on what could be the root cause and how to fix this? Thanks!

Log:



  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
-- | -- | --
  | 2023-06-08T08:51:15.973-07:00 | return inner_training_loop(
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
  | 2023-06-08T08:51:15.973-07:00 | self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
  | 2023-06-08T08:51:15.973-07:00 | metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2932, in evaluate
  | 2023-06-08T08:51:15.973-07:00 | output = eval_loop(
  | 2023-06-08T08:51:15.973-07:00 | File "/workspace/mfive/mfive/trainer.py", line 236, in evaluation_loop
  | 2023-06-08T08:51:15.973-07:00 | for step, inputs in enumerate(dataloader):
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/dataloader2/dataloader2.py", line 46, in __next__
  | 2023-06-08T08:51:15.973-07:00 | next_val = next(self.dataloader._datapipe_iter) # type: ignore[arg-type]
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 173, in wrap_generator
  | 2023-06-08T08:51:15.973-07:00 | response = gen.send(None)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/datapipes/iter/util/distributed.py", line 178, in __iter__
  | 2023-06-08T08:51:15.973-07:00 | self._process_group = dist.new_group(backend="gloo")
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3520, in new_group
  | 2023-06-08T08:51:15.973-07:00 | pg = _new_process_group_helper(
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
  | 2023-06-08T08:51:15.973-07:00 | backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
  | 2023-06-08T08:51:15.973-07:00 | RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:176] bind: Address already in use
  | 2023-06-08T08:51:15.973-07:00 | This exception is thrown by __iter__ of FullSyncIterDataPipe(datapipe=CollatorIterDataPipe, timeout=1800)

Versions

Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy==0.991
[pip3] mypy-boto3-batch==1.26.103
[pip3] mypy-boto3-ec2==1.26.136
[pip3] mypy-boto3-iam==1.26.97
[pip3] mypy-boto3-s3==1.26.127
[pip3] mypy-boto3-sagemaker==1.26.141
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] torch==2.0.1
[pip3] torch-tb-profiler==0.4.1
[pip3] torchdata==0.6.1
[pip3] torchmetrics==0.11.4
[pip3] torchsnapshot-nightly==2023.3.15
[pip3] torchvision==0.15.2
[pip3] torchx-nightly==2023.5.25
[pip3] triton==2.0.0
[conda] numpy                     1.24.3                   pypi_0    pypi
[conda] torch                     2.0.1                    pypi_0    pypi
[conda] torch-tb-profiler         0.4.1                    pypi_0    pypi
[conda] torchdata                 0.6.1                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchsnapshot-nightly     2023.3.15                pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi
[conda] torchx-nightly            2023.5.25                pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypi

chenxingyu-cs avatar Jun 19 '23 18:06 chenxingyu-cs

@ejguan Hi can you help provide some insights you have? Great thanks!

chenxingyu-cs avatar Jun 20 '23 19:06 chenxingyu-cs

Are you running multiple DPP at the same time?

ejguan avatar Jun 20 '23 21:06 ejguan

@ejguan I'm only running one DDP job. The DDP job is initialized by torchx. And I got these errors while running job on AWS Batch and SageMaker, where I believe all the instances are isolated and there should be no other job running.

chenxingyu-cs avatar Jun 22 '23 17:06 chenxingyu-cs