data Dataloader2 with FullSyncIterDataPipe throws error during initilization

Dataloader2 with FullSyncIterDataPipe throws error during initilization

Open chenxingyu-cs opened this issue 1 year ago • 3 comments

🐛 Describe the bug

Hi, we found some strange during using Dataloader2. Here's some details about the issue.

We are a long run training job with 8 AWS P4 nodes. It's using HuggingFace trainer.
In HuggingFace training, it will call evaluation every traininig_args.eval_steps training steps.
I overrided the HF trainer to use Dataloader2 with training, evaluation and test dataset loading. At the same time, on the dataset part, I'm using IterableDataPipe with ShardingFilterIterDataPipe
The issue that listed the log happens randomly. And most time it happens after the job runs for a long time (e.g. 20+ hours)

Can you help provide some context on what could be the root cause and how to fix this? Thanks!

Log:



  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
-- | -- | --
  | 2023-06-08T08:51:15.973-07:00 | return inner_training_loop(
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
  | 2023-06-08T08:51:15.973-07:00 | self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
  | 2023-06-08T08:51:15.973-07:00 | metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2932, in evaluate
  | 2023-06-08T08:51:15.973-07:00 | output = eval_loop(
  | 2023-06-08T08:51:15.973-07:00 | File "/workspace/mfive/mfive/trainer.py", line 236, in evaluation_loop
  | 2023-06-08T08:51:15.973-07:00 | for step, inputs in enumerate(dataloader):
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/dataloader2/dataloader2.py", line 46, in __next__
  | 2023-06-08T08:51:15.973-07:00 | next_val = next(self.dataloader._datapipe_iter) # type: ignore[arg-type]
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 173, in wrap_generator
  | 2023-06-08T08:51:15.973-07:00 | response = gen.send(None)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/datapipes/iter/util/distributed.py", line 178, in __iter__
  | 2023-06-08T08:51:15.973-07:00 | self._process_group = dist.new_group(backend="gloo")
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3520, in new_group
  | 2023-06-08T08:51:15.973-07:00 | pg = _new_process_group_helper(
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
  | 2023-06-08T08:51:15.973-07:00 | backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
  | 2023-06-08T08:51:15.973-07:00 | RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:176] bind: Address already in use
  | 2023-06-08T08:51:15.973-07:00 | This exception is thrown by __iter__ of FullSyncIterDataPipe(datapipe=CollatorIterDataPipe, timeout=1800)

Versions

Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy==0.991
[pip3] mypy-boto3-batch==1.26.103
[pip3] mypy-boto3-ec2==1.26.136
[pip3] mypy-boto3-iam==1.26.97
[pip3] mypy-boto3-s3==1.26.127
[pip3] mypy-boto3-sagemaker==1.26.141
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] torch==2.0.1
[pip3] torch-tb-profiler==0.4.1
[pip3] torchdata==0.6.1
[pip3] torchmetrics==0.11.4
[pip3] torchsnapshot-nightly==2023.3.15
[pip3] torchvision==0.15.2
[pip3] torchx-nightly==2023.5.25
[pip3] triton==2.0.0
[conda] numpy                     1.24.3                   pypi_0    pypi
[conda] torch                     2.0.1                    pypi_0    pypi
[conda] torch-tb-profiler         0.4.1                    pypi_0    pypi
[conda] torchdata                 0.6.1                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchsnapshot-nightly     2023.3.15                pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi
[conda] torchx-nightly            2023.5.25                pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypi

Jun 19 '23 18:06 chenxingyu-cs

@ejguan Hi can you help provide some insights you have? Great thanks!

Jun 20 '23 19:06 chenxingyu-cs

Are you running multiple DPP at the same time?

Jun 20 '23 21:06 ejguan

@ejguan I'm only running one DDP job. The DDP job is initialized by torchx. And I got these errors while running job on AWS Batch and SageMaker, where I believe all the instances are isolated and there should be no other job running.

Jun 22 '23 17:06 chenxingyu-cs

data data copied to clipboard

Dataloader2 with FullSyncIterDataPipe throws error during initilization

🐛 Describe the bug

Versions

data
data copied to clipboard