transformers
transformers copied to clipboard
Socket Timeout when using DDP
System Info
- `transformers` version: 4.17.0.dev0
- Platform: Linux-4.15.0-176-generic-x86_64-with-glibc2.17
- Python version: 3.8.13
- PyTorch version (GPU?): 1.8.2 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes (run_summarization.py script)
Who can help?
@patrickvonplaten @patil-suraj
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I'm constructing a dataset (.parquet format) that is similar to json format, but has other fields to construct graph for examples in the dataset. When I'm training the model in DDP mode (distributed), I'm getting RuntimeError: Socket Timeout
. Here is the full stack:
Running tokenizer on train dataset #0: 24%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 7/29 [28:27<1:46:58, 291.73s/ba]Traceback (most recent call last): #1: 24%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 7/29 [28:54<1:46:07, 289.45s/ba]
File "examples/pytorch/summarization/run_summarization.py", line 987, in <module>βββββββββ | 7/29 [30:24<1:49:35, 298.88s/ba]
main()kenizer on train dataset #3: 24%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 7/29 [28:46<1:43:47, 283.05s/ba]
File "examples/pytorch/summarization/run_summarization.py", line 791, in mainββββββ | 6/29 [27:32<1:57:16, 305.93s/ba]
with training_args.main_process_first(desc="train dataset map pre-processing"):βββββββββ | 7/29 [27:45<1:42:39, 279.97s/ba]
File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/contextlib.py", line 113, in __enter__ | 6/29 [26:27<1:54:13, 297.97s/ba]
return next(self.gen)n dataset #7: 21%|βββββββββββββββββββββββββββββββββββββββββ | 6/29 [25:48<1:51:59, 292.15s/ba]
File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/training_args.py", line 1264, in main_process_first | 6/29 [26:27<1:52:41, 293.96s/ba]
torch.distributed.barrier()set #9: 24%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 7/29 [29:50<1:45:55, 288.90s/ba]
File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout
Killing subprocess 62044
Killing subprocess 62045
Traceback (most recent call last):
File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
Expected behavior
Running the preprocessing function on each training split.
Not sure if it's related to dataset (.parquet format). Could you please post the code snippet you used to launch the script ? Thanks.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I met the same error. I tried to pre-train with 25GB korean corpus data using example/run_clm.py. I haven't tested it in an environment not using DDP yet, but I think this problem is related to corpus. Because there was no problem when it was a small corpus. The process killed about 30000 ~ 32000 of 85249. The tokenizer type is Byte-level BPE.
- My script
python -m torch.distributed.launch \
--nproc_per_node 4 $TRANSFORMERS_PATH/pytorch/language-modeling/run_clm.py \
--model_type gpt2 \
--tokenizer_name $TOKENIZER_PATH/$MODEL_NAME \
--config_overrides bos_token_id=0,eos_token_id=0 \
--block_size 1024 \
--train_file $DATASET_PATH/train.txt \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--do_train \
--do_eval \
--output_dir $MODEL_PATH/$MODEL_NAME \
--num_train_epochs 5 \
--weight_decay 0.01 \
--learning_rate 1e-5 \
--warmup_steps 8000 \
--save_strategy steps \
--save_steps 4000 \
--save_total_limit 10 \
--evaluation_strategy steps \
--eval_steps 4000 \
--load_best_model_at_end \
--validation_split_percentage 5
- logs
Running tokenizer on dataset: 38%|ββββ | 32066/85249 [32:12<53:25, 16.59ba/s]
Running tokenizer on dataset: 38%|ββββ | 32068/85249 [32:13<53:59, 16.42ba/s]Traceback (most recent call last):
File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 563, in <module>
main()
File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 397, in main
with training_args.main_process_first(desc="dataset map tokenization"):
File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/home/dofirst/workspace/transformers/src/transformers/training_args.py", line 1368, in main_process_first
Traceback (most recent call last):
File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 563, in <module>
Traceback (most recent call last):
File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 563, in <module>
torch.distributed.barrier()
File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
main()
main() File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 397, in main
File "/home/dofirst/workspace/scripts/../../transformers/examples/pytorch/language-modeling/run_clm.py", line 397, in main
with training_args.main_process_first(desc="dataset map tokenization"):
with training_args.main_process_first(desc="dataset map tokenization"): File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/contextlib.py", line 113, in __enter__
File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
return next(self.gen)
File "/home/dofirst/workspace/transformers/src/transformers/training_args.py", line 1368, in main_process_first
File "/home/dofirst/workspace/transformers/src/transformers/training_args.py", line 1368, in main_process_first
work = default_pg.barrier(opts=opts)
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f09757d01bd in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x6c (0x7f09757cc90c in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x11f (0x7f09ab3b3d4f in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x21 (0x7f09ab3b4cd1 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x5b (0x7f09ab3b4d5b in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f09ab3868a2 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f09ab3868a2 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f09ab3868a2 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::string const&, int) + 0xe4 (0x7f09b3661df4 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x1d9 (0x7f09b3665e89 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #10: <unknown function> + 0xb4c325 (0x7f09b3669325 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #11: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0xf (0x7f09b366a61f in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #12: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x2d3 (0x7f09b3670733 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #13: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x72a (0x7f09b367a18a in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #14: <unknown function> + 0x800291 (0x7f09f8d2b291 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x1e5d67 (0x7f09f8710d67 in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x13c00e (0x559abc56400e in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #17: _PyObject_MakeTpCall + 0x3bf (0x559abc55913f in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #18: <unknown function> + 0x166ca0 (0x559abc58eca0 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x1510 (0x559abc5ffeb0 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #20: <unknown function> + 0x1c7d37 (0x559abc5efd37 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x4f83 (0x559abc603923 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #22: <unknown function> + 0x197bc5 (0x559abc5bfbc5 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #23: <unknown function> + 0x13b23d (0x559abc56323d in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x71b (0x559abc5ff0bb in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #25: _PyFunction_Vectorcall + 0x1b7 (0x559abc5f57e7 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #26: <unknown function> + 0x9ce79 (0x559abc4c4e79 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #27: <unknown function> + 0x13bb70 (0x559abc563b70 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x21a2 (0x559abc600b42 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #29: _PyEval_EvalCodeWithName + 0xd5f (0x559abc5f50ff in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #30: _PyFunction_Vectorcall + 0x594 (0x559abc5f5bc4 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x71b (0x559abc5ff0bb in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #32: _PyEval_EvalCodeWithName + 0x260 (0x559abc5f4600 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #33: PyEval_EvalCode + 0x23 (0x559abc5f5eb3 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #34: <unknown function> + 0x242622 (0x559abc66a622 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #35: <unknown function> + 0x2531d2 (0x559abc67b1d2 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #36: <unknown function> + 0x25636b (0x559abc67e36b in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #37: PyRun_SimpleFileExFlags + 0x1bf (0x559abc67e54f in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #38: Py_RunMain + 0x3a9 (0x559abc67ea29 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #39: Py_BytesMain + 0x39 (0x559abc67ec29 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
frame #40: __libc_start_main + 0xe7 (0x7f0a3fb46c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: <unknown function> + 0x1f9ad7 (0x559abc621ad7 in /home/dofirst/miniconda3/envs/huggingface/bin/python)
torch.distributed.barrier()
File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
torch.distributed.barrier()
File "/home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
work = default_pg.barrier(opts=opts)
work = default_pg.barrier(opts=opts)
RuntimeError: RuntimeError[2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f768dda91bd in /home/dofirst/miniconda3/envs/huggingface/lib/python3.8/site-packages/torch/lib/libc10.so)
I met the same error. I tried to pre-train with 25GB korean corpus data using example/run_clm.py. I haven't tested it in an environment not using DDP yet, but I think this problem is related to corpus. Because there was no problem when it was a small corpus. The process killed about 30000 ~ 32000 of 85249. The tokenizer type is Byte-level BPE.
I succeeded in pre-training without DDP. Running tokenizer was finished well and I could use this cache data with DDP after tokenizing. I don't know the cause yet, but this problem seems to be related to DDP.
My English is not that great. Nevertheless I want to solve this problem.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Looks like that the process gets killed due to torch.distributed.launch/run timeout of 30 minutes? (https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)
I had the same problem, where my job would be stopped when using DDP due to the long mapping/tokenization.
i have a similar task, and my torch.distributed launch gets interupted due to the 30mins timeout.
in my case when i run the script normally like, python run.py
it gets cached, but when i run it in torch.distributed launch it isnt getting cached, and the entire preprocessing step occurs again, and gets timedout
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Re-opening as it doesn't seem like it's been solved . Maybe @sgugger could help here?
@patrickvonplaten To give you some clue about the potential root of the problem, given my experiment and #, I believe this happens when the script deals with extremely large-scale datasets. Mine was above >100GB, most of which related to the graph fields that I had put in each example (in the parquet file). I could manage to get this passed by running on a single GPU, and then using cached file for fast load in multi-GPU setting.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey guys, I'm having the same issue here when running in distributed (both with torch.distributed.launch
and both with elastic run), seems to me like this isn't solved yet.
My system info:
-
transformers
version: 4.24.0 - Platform: Linux-4.15.0-166-generic
- Python version: 3.8.10
- PyTorch version (GPU?): 1.10.2+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script (Number)?: Yes (7-8 GPUs).
- Using distributed or parallel set-up in script?: Yes (run_clm.py script)
- Number of nodes in distributed: 1
My run information
- Modified scripts: My own modified script of run_clm.py, released in version 4.24.0.
- Dataset: openwebtext (from the hub)
Notes
- When using smaller dataset (e.g. wikitext-2, wikitext-103) I'm not having the issue.
- As mentioned above, the error appears after ~30 minutes. In my case 31:05 minutes.
Reproduction
I'm running the following:
torchrun
--standalone
--nnodes=1
--nproc_per_node=${NUM_GPU}
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py
--model_name_or_path ${MODEL}
--dataset_name ${DS_NAME}
--save_steps ${SAVE_STEPS}
--logging_steps 1000
--eval_steps 2000
--do_train
--do_eval
--seed ${RANDOM}
--max_steps ${MAX_TRAIN_STEPS}
--learning_rate ${LR}
--per_device_train_batch_size ${TRAIN_BATCH}
--gradient_accumulation_steps ${ACC_STEPS}
--per_device_eval_batch_size ${EVAL_BATCH}
--evaluation_strategy steps
--logging_dir ${OUTPUT_DIR}
--output_dir ${OUTPUT_DIR}
--overwrite_output_dir
--load_best_model_at_end
--max_train_samples 200
--max_eval_samples 200 \
And I'm having the following error output:
main() File "./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py", line 414, in main File "./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py", line 414, in main File "./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py", line 414, in main main() File "./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py", line 414, in main with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with training_args.main_process_first(desc="dataset map tokenization"): File "/usr/lib/python3.8/contextlib.py", line 113, in enter return next(self.gen) return next(self.gen)return next(self.gen)
File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first return next(self.gen)return next(self.gen)
return next(self.gen) File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first
File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first File "/venv/lib/python3.8/site-packages/transformers/training_args.py", line 1668, in main_process_first torch.distributed.barrier()torch.distributed.barrier()
torch.distributed.barrier() File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier torch.distributed.barrier() File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier >
torch.distributed.barrier() torch.distributed.barrier() File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrierFile "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier File "/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier work = default_pg.barrier(opts=opts) work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)
RuntimeErrorRuntimeError work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts): : Socket TimeoutSocket Timeout
RuntimeError: Socket Timeout work = default_pg.barrier(opts=opts) RuntimeErrorRuntimeError: Socket Timeout : Socket Timeout RuntimeError: Socket Timeout WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5732 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 5733) of binary: /venv/bin/python3 Traceback (most recent call last): File "/venv/bin/torchrun", line 8, in
sys.exit(main()) File "/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures: [1]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 2 (local_rank: 2) exitcode : 1 (pid: 5734) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 3 (local_rank: 3) exitcode : 1 (pid: 5735) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 4 (local_rank: 4) exitcode : 1 (pid: 5736) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 5 (local_rank: 5) exitcode : 1 (pid: 5737) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 6 (local_rank: 6) exitcode : 1 (pid: 5738) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html Root Cause (first observed failure): [0]: time : 2022-11-11_07:49:25 host : ido-branch-s2n2k-pxvf4 rank : 1 (local_rank: 1) exitcode : 1 (pid: 5733) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Thanks a lot in advance for looking into it
You are not using the ddp_timeout
training argument to put a higher value than 30 minutes, so if you have a big dataset to preprocess, you get this error. Use a bigger value to solve this error or preprocess your dataset in a non-distributed fashion.
@sgugger what if I am launching my script with a "torch.distributed.launch" utility? Then, even if I update the ddp_timeout it does not get reflected, and the processes halt in 30 minutes (the default time).
met same problem
If you use torch.distributed.launch
with a ddp_timeout
that is not listened to, it sounds like a bug in PyTorch ;-)
I met the same error. I tried to pre-train with 25GB korean corpus data using example/run_clm.py. I haven't tested it in an environment not using DDP yet, but I think this problem is related to corpus. Because there was no problem when it was a small corpus. The process killed about 30000 ~ 32000 of 85249. The tokenizer type is Byte-level BPE.
I succeeded in pre-training without DDP. Running tokenizer was finished well and I could use this cache data with DDP after tokenizing. I don't know the cause yet, but this problem seems to be related to DDP.
My English is not that great. Nevertheless I want to solve this problem.
Hi, I faced the same error as your. It seems your solution is the only way to solve this problem. But how to tokenize data without ddp and then using cache data with ddp training? When I use distributed training, it will start multi process at the begining.