Reminder
- [X] I have read the README and searched the existing issues.
System Info
v0.8.3 版本
Reproduction
FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
会报错:
master报错:
(llamafactory) linewell@aifs3-master-1:/datas/work/hecheng/LLaMA-Factory1/LLaMA-Factory$ FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR=192.168.175.4 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-09-16 17:38:36,034] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
09/16/2024 17:38:40 - INFO - llamafactory.cli - Initializing distributed tasks at: 192.168.175.4:29500
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757]
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757] *****************************************
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757] *****************************************
Traceback (most recent call last):
File "/home/linewell/miniforge3/envs/llamafactory/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 870, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
self._rendezvous(worker_group)
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 551, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 638, in _assign_worker_ranks
role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 675, in _share_and_gather
role_infos_bytes = store_util.synchronize(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistStoreError: Socket Timeout
worker报错:
(llama-factory) linewell@aifs3-worker-1:/datas/work/hecheng/LLaMA-Factory$ FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR=192.168.175.4 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-09-16 17:39:19,896] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
09/16/2024 17:39:23 - INFO - llamafactory.cli - Initializing distributed tasks at: 192.168.175.4:29500
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779]
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779] *****************************************
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779] *****************************************
[rank1]:[E916 17:44:13.419203922 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=616, OpType=ALLREDUCE, NumelIn=17891328, NumelOut=17891328, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
[rank1]:[E916 17:44:13.423659979 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 616, last enqueued NCCL work: 616, last completed NCCL work: 615.
[rank1]: Traceback (most recent call last):
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 3349, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/accelerate/accelerator.py", line 2188, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
[rank1]: self.engine.backward(loss, **kwargs)
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2046, in backward
[rank1]: self.allreduce_gradients()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1965, in allreduce_gradients
[rank1]: self.optimizer.overlapping_partition_gradients_reduce_epilogue()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 868, in overlapping_partition_gradients_reduce_epilogue
[rank1]: self.independent_gradient_partition_epilogue()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 769, in independent_gradient_partition_epilogue
[rank1]: get_accelerator().synchronize()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/accelerator/cuda_accelerator.py", line 79, in synchronize
[rank1]: return torch.cuda.synchronize(device_index)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/cuda/init.py", line 892, in synchronize
[rank1]: return torch._C._cuda_synchronize()
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: KeyboardInterrupt
[rank1]:[E916 17:44:14.559958087 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 1] Timeout at NCCL work: 616, last enqueued NCCL work: 616, last completed NCCL work: 615.
[rank1]:[E916 17:44:14.560003542 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E916 17:44:14.560015443 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E916 17:44:14.562376704 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=616, OpType=ALLREDUCE, NumelIn=17891328, NumelOut=17891328, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fab52ee9f86 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fab541e68f2 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fab541ed333 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fab541ef71c in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xbd6df (0x7faba19396df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x76db (0x7faba39776db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7faba2efb61f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=616, OpType=ALLREDUCE, NumelIn=17891328, NumelOut=17891328, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fab52ee9f86 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fab541e68f2 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fab541ed333 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fab541ef71c in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xbd6df (0x7faba19396df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x76db (0x7faba39776db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7faba2efb61f in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fab52ee9f86 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x7fab53e78a84 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xbd6df (0x7faba19396df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x76db (0x7faba39776db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x3f (0x7faba2efb61f in /lib/x86_64-linux-gnu/libc.so.6)
Traceback (most recent call last):
File "/home/linewell/miniforge3/envs/llama-factory/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 829, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 652, in _initialize_workers
self._rendezvous(worker_group)
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 502, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 616, in _assign_worker_ranks
store.get(f"{ASSIGNED_RANKS_PREFIX}{group_rank}")
torch.distributed.DistNetworkError: Connection reset by peer
Expected behavior
正常运行
Others
正常运行