data
data copied to clipboard
Dataloader2 with FullSyncIterDataPipe throws error during initilization
🐛 Describe the bug
Hi, we found some strange during using Dataloader2. Here's some details about the issue.
- We are a long run training job with 8 AWS P4 nodes. It's using HuggingFace trainer.
- In HuggingFace training, it will call evaluation every
traininig_args.eval_steps
training steps. - I overrided the HF trainer to use Dataloader2 with training, evaluation and test dataset loading. At the same time, on the dataset part, I'm using
IterableDataPipe
withShardingFilterIterDataPipe
- The issue that listed the log happens randomly. And most time it happens after the job runs for a long time (e.g. 20+ hours)
Can you help provide some context on what could be the root cause and how to fix this? Thanks!
Log:
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
-- | -- | --
| 2023-06-08T08:51:15.973-07:00 | return inner_training_loop(
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
| 2023-06-08T08:51:15.973-07:00 | self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
| 2023-06-08T08:51:15.973-07:00 | metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2932, in evaluate
| 2023-06-08T08:51:15.973-07:00 | output = eval_loop(
| 2023-06-08T08:51:15.973-07:00 | File "/workspace/mfive/mfive/trainer.py", line 236, in evaluation_loop
| 2023-06-08T08:51:15.973-07:00 | for step, inputs in enumerate(dataloader):
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/dataloader2/dataloader2.py", line 46, in __next__
| 2023-06-08T08:51:15.973-07:00 | next_val = next(self.dataloader._datapipe_iter) # type: ignore[arg-type]
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 173, in wrap_generator
| 2023-06-08T08:51:15.973-07:00 | response = gen.send(None)
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/datapipes/iter/util/distributed.py", line 178, in __iter__
| 2023-06-08T08:51:15.973-07:00 | self._process_group = dist.new_group(backend="gloo")
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3520, in new_group
| 2023-06-08T08:51:15.973-07:00 | pg = _new_process_group_helper(
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
| 2023-06-08T08:51:15.973-07:00 | backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
| 2023-06-08T08:51:15.973-07:00 | RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:176] bind: Address already in use
| 2023-06-08T08:51:15.973-07:00 | This exception is thrown by __iter__ of FullSyncIterDataPipe(datapipe=CollatorIterDataPipe, timeout=1800)
Versions
Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy==0.991
[pip3] mypy-boto3-batch==1.26.103
[pip3] mypy-boto3-ec2==1.26.136
[pip3] mypy-boto3-iam==1.26.97
[pip3] mypy-boto3-s3==1.26.127
[pip3] mypy-boto3-sagemaker==1.26.141
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] torch==2.0.1
[pip3] torch-tb-profiler==0.4.1
[pip3] torchdata==0.6.1
[pip3] torchmetrics==0.11.4
[pip3] torchsnapshot-nightly==2023.3.15
[pip3] torchvision==0.15.2
[pip3] torchx-nightly==2023.5.25
[pip3] triton==2.0.0
[conda] numpy 1.24.3 pypi_0 pypi
[conda] torch 2.0.1 pypi_0 pypi
[conda] torch-tb-profiler 0.4.1 pypi_0 pypi
[conda] torchdata 0.6.1 pypi_0 pypi
[conda] torchmetrics 0.11.4 pypi_0 pypi
[conda] torchsnapshot-nightly 2023.3.15 pypi_0 pypi
[conda] torchvision 0.15.2 pypi_0 pypi
[conda] torchx-nightly 2023.5.25 pypi_0 pypi
[conda] triton 2.0.0 pypi_0 pypi
@ejguan Hi can you help provide some insights you have? Great thanks!
Are you running multiple DPP at the same time?
@ejguan I'm only running one DDP job. The DDP job is initialized by torchx. And I got these errors while running job on AWS Batch and SageMaker, where I believe all the instances are isolated and there should be no other job running.