transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Socket Timeout when using DDP

Open sajastu opened this issue 2 years ago β€’ 8 comments

System Info

- `transformers` version: 4.17.0.dev0
- Platform: Linux-4.15.0-176-generic-x86_64-with-glibc2.17
- Python version: 3.8.13
- PyTorch version (GPU?): 1.8.2 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes (run_summarization.py script)

Who can help?

@patrickvonplaten @patil-suraj

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

I'm constructing a dataset (.parquet format) that is similar to json format, but has other fields to construct graph for examples in the dataset. When I'm training the model in DDP mode (distributed), I'm getting RuntimeError: Socket Timeout. Here is the full stack:

Running tokenizer on train dataset #0:  24%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                                                                                                     | 7/29 [28:27<1:46:58, 291.73s/ba]Traceback (most recent call last): #1:  24%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                                                                                                     | 7/29 [28:54<1:46:07, 289.45s/ba]
  File "examples/pytorch/summarization/run_summarization.py", line 987, in <module>β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                                                                                                     | 7/29 [30:24<1:49:35, 298.88s/ba]
    main()kenizer on train dataset #3:  24%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                                                                                                     | 7/29 [28:46<1:43:47, 283.05s/ba]
  File "examples/pytorch/summarization/run_summarization.py", line 791, in mainβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                                                                                                                                                            | 6/29 [27:32<1:57:16, 305.93s/ba]
    with training_args.main_process_first(desc="train dataset map pre-processing"):β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                                                                                                     | 7/29 [27:45<1:42:39, 279.97s/ba]
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/contextlib.py", line 113, in __enter__                                                                                                                                               | 6/29 [26:27<1:54:13, 297.97s/ba]
    return next(self.gen)n dataset #7:  21%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                                                                                                                                                            | 6/29 [25:48<1:51:59, 292.15s/ba]
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/training_args.py", line 1264, in main_process_first                                                                                                       | 6/29 [26:27<1:52:41, 293.96s/ba]
    torch.distributed.barrier()set #9:  24%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                                                                                                     | 7/29 [29:50<1:45:55, 288.90s/ba]
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout
Killing subprocess 62044
Killing subprocess 62045
Traceback (most recent call last):
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main

Expected behavior

Running the preprocessing function on each training split.

sajastu avatar May 05 '22 23:05 sajastu

Not sure if it's related to dataset (.parquet format). Could you please post the code snippet you used to launch the script ? Thanks.

patil-suraj avatar May 11 '22 11:05 patil-suraj

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 05 '22 15:06 github-actions[bot]

I met the same error. I tried to pre-train with 25GB korean corpus data using example/run_clm.py. I haven't tested it in an environment not using DDP yet, but I think this problem is related to corpus. Because there was no problem when it was a small corpus. The process killed about 30000 ~ 32000 of 85249. The tokenizer type is Byte-level BPE.

  • My script
python -m torch.distributed.launch \
		--nproc_per_node 4 $TRANSFORMERS_PATH/pytorch/language-modeling/run_clm.py \
    --model_type gpt2 \
    --tokenizer_name $TOKENIZER_PATH/$MODEL_NAME \
    --config_overrides bos_token_id=0,eos_token_id=0 \
    --block_size 1024 \
    --train_file $DATASET_PATH/train.txt \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --output_dir $MODEL_PATH/$MODEL_NAME \
    --num_train_epochs 5 \
    --weight_decay 0.01 \
    --learning_rate 1e-5 \
    --warmup_steps 8000 \
    --save_strategy steps \
    --save_steps 4000 \
    --save_total_limit 10 \
    --evaluation_strategy steps \
    --eval_steps 4000 \
    --load_best_model_at_end \
    --validation_split_percentage 5
  • logs
Running tokenizer on dataset:  38%|β–ˆβ–ˆβ–ˆβ–Š      | 32066/85249 [32:12<53:25, 16.59ba/s]
Running tokenizer on dataset:  38%|β–ˆβ–ˆβ–ˆβ–Š      | 32068/85249 [32:13<53:59, 16.42ba/s]Traceback (most recent call last):
  File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 563, in <module>
    main()
  File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 397, in main
    with training_args.main_process_first(desc="dataset map tokenization"):
  File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/dofirst/workspace/transformers/src/transformers/training_args.py", line 1368, in main_process_first
Traceback (most recent call last):
  File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 563, in <module>
Traceback (most recent call last):
  File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 563, in <module>
    torch.distributed.barrier()
  File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
    main()    
main()  File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 397, in main

  File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 397, in main
    with training_args.main_process_first(desc="dataset map tokenization"):    
with training_args.main_process_first(desc="dataset map tokenization"):  File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/contextlib.py", line 113, in __enter__

  File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)    
return next(self.gen)
  File "/home/dofirst/workspace/transformers/src/transformers/training_args.py", line 1368, in main_process_first
  File "/home/dofirst/workspace/transformers/src/transformers/training_args.py", line 1368, in main_process_first
    work = default_pg.barrier(opts=opts)
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f09757d01bd in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x6c (0x7f09757cc90c in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x11f (0x7f09ab3b3d4f in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x21 (0x7f09ab3b4cd1 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x5b (0x7f09ab3b4d5b in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f09ab3868a2 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f09ab3868a2 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f09ab3868a2 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::string const&, int) + 0xe4 (0x7f09b3661df4 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x1d9 (0x7f09b3665e89 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #10: <unknown function> + 0xb4c325 (0x7f09b3669325 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #11: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0xf (0x7f09b366a61f in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #12: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x2d3 (0x7f09b3670733 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #13: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x72a (0x7f09b367a18a in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #14: <unknown function> + 0x800291 (0x7f09f8d2b291 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x1e5d67 (0x7f09f8710d67 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x13c00e (0x559abc56400e in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #17: _PyObject_MakeTpCall + 0x3bf (0x559abc55913f in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #18: <unknown function> + 0x166ca0 (0x559abc58eca0 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x1510 (0x559abc5ffeb0 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #20: <unknown function> + 0x1c7d37 (0x559abc5efd37 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x4f83 (0x559abc603923 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #22: <unknown function> + 0x197bc5 (0x559abc5bfbc5 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #23: <unknown function> + 0x13b23d (0x559abc56323d in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x71b (0x559abc5ff0bb in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #25: _PyFunction_Vectorcall + 0x1b7 (0x559abc5f57e7 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #26: <unknown function> + 0x9ce79 (0x559abc4c4e79 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #27: <unknown function> + 0x13bb70 (0x559abc563b70 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x21a2 (0x559abc600b42 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #29: _PyEval_EvalCodeWithName + 0xd5f (0x559abc5f50ff in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #30: _PyFunction_Vectorcall + 0x594 (0x559abc5f5bc4 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x71b (0x559abc5ff0bb in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #32: _PyEval_EvalCodeWithName + 0x260 (0x559abc5f4600 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #33: PyEval_EvalCode + 0x23 (0x559abc5f5eb3 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #34: <unknown function> + 0x242622 (0x559abc66a622 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #35: <unknown function> + 0x2531d2 (0x559abc67b1d2 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #36: <unknown function> + 0x25636b (0x559abc67e36b in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #37: PyRun_SimpleFileExFlags + 0x1bf (0x559abc67e54f in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #38: Py_RunMain + 0x3a9 (0x559abc67ea29 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #39: Py_BytesMain + 0x39 (0x559abc67ec29 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #40: __libc_start_main + 0xe7 (0x7f0a3fb46c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: <unknown function> + 0x1f9ad7 (0x559abc621ad7 in /home/dofirst/miniconda3/envs/huggingface/bin/python)

    torch.distributed.barrier()
  File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
    torch.distributed.barrier()
  File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
    work = default_pg.barrier(opts=opts)    
work = default_pg.barrier(opts=opts)
RuntimeError: RuntimeError[2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f768dda91bd in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libc10.so)

tospirits avatar Jun 07 '22 02:06 tospirits

I met the same error. I tried to pre-train with 25GB korean corpus data using example/run_clm.py. I haven't tested it in an environment not using DDP yet, but I think this problem is related to corpus. Because there was no problem when it was a small corpus. The process killed about 30000 ~ 32000 of 85249. The tokenizer type is Byte-level BPE.

I succeeded in pre-training without DDP. Running tokenizer was finished well and I could use this cache data with DDP after tokenizing. I don't know the cause yet, but this problem seems to be related to DDP.

My English is not that great. Nevertheless I want to solve this problem.

tospirits avatar Jun 07 '22 10:06 tospirits

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 01 '22 15:07 github-actions[bot]

Looks like that the process gets killed due to torch.distributed.launch/run timeout of 30 minutes? (https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)

I had the same problem, where my job would be stopped when using DDP due to the long mapping/tokenization.

gugarosa avatar Jul 07 '22 12:07 gugarosa

i have a similar task, and my torch.distributed launch gets interupted due to the 30mins timeout.

in my case when i run the script normally like, python run.py it gets cached, but when i run it in torch.distributed launch it isnt getting cached, and the entire preprocessing step occurs again, and gets timedout

StephennFernandes avatar Jul 14 '22 15:07 StephennFernandes

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 08 '22 15:08 github-actions[bot]

Re-opening as it doesn't seem like it's been solved . Maybe @sgugger could help here?

patrickvonplaten avatar Aug 31 '22 12:08 patrickvonplaten

@patrickvonplaten To give you some clue about the potential root of the problem, given my experiment and #, I believe this happens when the script deals with extremely large-scale datasets. Mine was above >100GB, most of which related to the graph fields that I had put in each example (in the parquet file). I could manage to get this passed by running on a single GPU, and then using cached file for fast load in multi-GPU setting.

sajastu avatar Aug 31 '22 12:08 sajastu

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 24 '22 15:09 github-actions[bot]

Hey guys, I'm having the same issue here when running in distributed (both with torch.distributed.launch and both with elastic run), seems to me like this isn't solved yet.

My system info:

  • transformers version: 4.24.0
  • Platform: Linux-4.15.0-166-generic
  • Python version: 3.8.10
  • PyTorch version (GPU?): 1.10.2+cu113 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script (Number)?: Yes (7-8 GPUs).
  • Using distributed or parallel set-up in script?: Yes (run_clm.py script)
  • Number of nodes in distributed: 1

My run information

  • Modified scripts: My own modified script of run_clm.py, released in version 4.24.0.
  • Dataset: openwebtext (from the hub)

Notes

  • When using smaller dataset (e.g. wikitext-2, wikitext-103) I'm not having the issue.
  • As mentioned above, the error appears after ~30 minutes. In my case 31:05 minutes.

Reproduction

I'm running the following:

torchrun
--standalone
--nnodes=1
--nproc_per_node=${NUM_GPU}
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py
--model_name_or_path ${MODEL}
--dataset_name ${DS_NAME}
--save_steps ${SAVE_STEPS}
--logging_steps 1000
--eval_steps 2000
--do_train
--do_eval
--seed ${RANDOM}
--max_steps ${MAX_TRAIN_STEPS}
--learning_rate ${LR}
--per_device_train_batch_size ${TRAIN_BATCH}
--gradient_accumulation_steps ${ACC_STEPS}
--per_device_eval_batch_size ${EVAL_BATCH}
--evaluation_strategy steps
--logging_dir ${OUTPUT_DIR}
--output_dir ${OUTPUT_DIR}
--overwrite_output_dir
--load_best_model_at_end
--max_train_samples 200
--max_eval_samples 200 \

And I'm having the following error output:

main() File "./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py", line 414, in main File "./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py", line 414, in main File "./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py", line 414, in main main() File "./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py", line 414, in main with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter return next(self.gen) return next(self.gen)return next(self.gen)

File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first return next(self.gen)return next(self.gen)

return next(self.gen) File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first

File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first torch.distributed.barrier()torch.distributed.barrier()

torch.distributed.barrier() File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier

File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier torch.distributed.barrier() File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier >
torch.distributed.barrier() torch.distributed.barrier() File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier

File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier work = default_pg.barrier(opts=opts) work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)

RuntimeErrorRuntimeError work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts): : Socket TimeoutSocket Timeout

RuntimeError: Socket Timeout work = default_pg.barrier(opts=opts) RuntimeErrorRuntimeError: Socket Timeout : Socket Timeout RuntimeError: Socket Timeout WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5732 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 5733) of binary: /venv/bin/python3 Traceback (most recent call last): File "/venv/bin/torchrun", line 8, in sys.exit(main()) File "/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED

Failures: [1]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 2 (local_rank: 2) exitcode : 1 (pid: 5734) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 3 (local_rank: 3) exitcode : 1 (pid: 5735) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 4 (local_rank: 4) exitcode : 1 (pid: 5736) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 5 (local_rank: 5) exitcode : 1 (pid: 5737) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 6 (local_rank: 6) exitcode : 1 (pid: 5738) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html Root Cause (first observed failure): [0]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 1 (local_rank: 1) exitcode : 1 (pid: 5733) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Thanks a lot in advance for looking into it

IdoAmit198 avatar Nov 11 '22 08:11 IdoAmit198

You are not using the ddp_timeout training argument to put a higher value than 30 minutes, so if you have a big dataset to preprocess, you get this error. Use a bigger value to solve this error or preprocess your dataset in a non-distributed fashion.

sgugger avatar Nov 14 '22 05:11 sgugger

@sgugger what if I am launching my script with a "torch.distributed.launch" utility? Then, even if I update the ddp_timeout it does not get reflected, and the processes halt in 30 minutes (the default time).

10-zin avatar Mar 15 '23 11:03 10-zin

met same problem

mingxiaoh avatar Mar 29 '23 01:03 mingxiaoh

If you use torch.distributed.launch with a ddp_timeout that is not listened to, it sounds like a bug in PyTorch ;-)

sgugger avatar Mar 29 '23 13:03 sgugger

I met the same error. I tried to pre-train with 25GB korean corpus data using example/run_clm.py. I haven't tested it in an environment not using DDP yet, but I think this problem is related to corpus. Because there was no problem when it was a small corpus. The process killed about 30000 ~ 32000 of 85249. The tokenizer type is Byte-level BPE.

I succeeded in pre-training without DDP. Running tokenizer was finished well and I could use this cache data with DDP after tokenizing. I don't know the cause yet, but this problem seems to be related to DDP.

My English is not that great. Nevertheless I want to solve this problem.

Hi, I faced the same error as your. It seems your solution is the only way to solve this problem. But how to tokenize data without ddp and then using cache data with ddp training? When I use distributed training, it will start multi process at the begining.

acadaiaca avatar Aug 21 '23 11:08 acadaiaca