transformers
transformers copied to clipboard
Save model checkpoint error when multi-gpu training still happens on 4.36.1
System Info
platform: linux python: 3.9 transformers: 4.36.1 running on two A10.2
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
I saw the release notes of 4.36.1 says this error already fixed, however, it still repeats after I install the latest version when I am running on a two A10.2 machine.
Traceback (most recent call last):
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2023-12-17 18:09:08 10.0.1.12: return _run_code(code, main_globals, None,
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/runpy.py", line 87, in _run_code
2023-12-17 18:09:08 10.0.1.12: exec(code, run_globals)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/decompressed_artifact/code/src/axolotl/cli/train.py", line 38, in <module>
2023-12-17 18:09:08 10.0.1.12: fire.Fire(do_cli)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
2023-12-17 18:09:08 10.0.1.12: component_trace = _Fire(component, args, parsed_flag_args, context, name)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
2023-12-17 18:09:08 10.0.1.12: component, remaining_args = _CallAndUpdateTrace(
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2023-12-17 18:09:08 10.0.1.12: component = fn(*varargs, **kwargs)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/decompressed_artifact/code/src/axolotl/cli/train.py", line 34, in do_cli
2023-12-17 18:09:08 10.0.1.12: train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/decompressed_artifact/code/src/axolotl/train.py", line 126, in train
2023-12-17 18:09:08 10.0.1.12: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
2023-12-17 18:09:08 10.0.1.12: return inner_training_loop(
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
2023-12-17 18:09:08 10.0.1.12: self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
2023-12-17 18:09:08 10.0.1.12: self._save_checkpoint(model, trial, metrics=metrics)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 2376, in _save_checkpoint
2023-12-17 18:09:08 10.0.1.12: self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer_callback.py", line 114, in save_to_json
2023-12-17 18:09:08 10.0.1.12: with open(json_path, "w", encoding="utf-8") as f:
2023-12-17 18:09:08 10.0.1.12: FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-1080/trainer_state.json'
Expected behavior
expect it to work.
Hi @z7ye, thanks for raising this issue!
Could you provide a minimal code snippet we can use to reproduce this error?
cc @muellerzr @pacman100
And please upgrade to 4.36.2
And please upgrade to 4.36.2
This problem occurs in training with multiple machines and multiple cards. Perhaps 4.36.2 did not solve this problem either, as 4.36.1 has already attempted to check for the presence of "stagg_output_dir" in "main_process".
Thanks, Iโll look into this
And please upgrade to 4.36.2
This problem occurs in training with multiple machines and multiple cards. Perhaps 4.36.2 did not solve this problem either, as 4.36.1 has already attempted to check for the presence of "stagg_output_dir" in "main_process".
Yes, 4.36.2 also suffers from the same problem, even though #28078 has been updated.
https://github.com/huggingface/transformers/pull/27929#issuecomment-1853861756
This adhoc can fix the problem. It works in my case
@ShaneTian or @hieu-blackbox can you please try pip install git+https://github.com/huggingface/transformers@muellerzr-multinode-save
? It's an alternative we can try as I agree I believe the issue exists only when we don't have a shared file system.
I see the error on 4.36.2 version as well, and I have a shared file system across each node. Using 2 nodes with 8 H100 gpus on each nodes.
ๆ่ ไฝ ่ฝ่ฏ่ฏๅ๏ผ่ฟๆฏๆไปฌๅฏไปฅๅฐ่ฏ็ๆฟไปฃๆนๆก๏ผๅ ไธบๆๅๆๆ็ธไฟกๅชๆๅฝๆไปฌๆฒกๆๅ ฑไบซๆไปถ็ณป็ปๆถๆๅญๅจ้ฎ้ขใ
pip install git+https://github.com/huggingface/transformers@muellerzr-multinode-save
After updating the code, deepspeed starts the cluster and saves the checkpoint named tmp checkpoint-10 from the node. The host point is checkpoint-10. After saving the checkpoint-10, Watchdog cause collective operation timeout occurs and the cluster training is interrupted
48%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 10/21 [33:20<35:45, 195.01s/it]/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2023-12-22 06:15:36,199] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/zero_pp_rank_8_mp_rank_00_model_states.pt...
[2023-12-22 06:15:39,569] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/zero_pp_rank_8_mp_rank_00_model_states.pt.
[2023-12-22 06:15:39,576] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt...
[2023-12-22 06:15:39,700] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt.
[2023-12-22 06:15:39,701] [INFO] [engine.py:3428:_save_zero_checkpoint] zero checkpoint saved /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt
[2023-12-22 06:15:39,764] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now!
[E ProcessGroupNCCL.cpp:475] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800100 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800116 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out.
a6000_node2:65506:493 [4] NCCL INFO [Service thread] Connection closed by localRank 4
a6000_node2:65509:495 [7] NCCL INFO [Service thread] Connection closed by localRank 7
a6000_node2:65505:498 [3] NCCL INFO [Service thread] Connection closed by localRank 3
a6000_node2:65506:465 [4] NCCL INFO comm 0xd875220 rank 12 nranks 16 cudaDev 4 busId 81000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 12] NCCL watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800116 milliseconds before timing out.
a6000_node2:65503:496 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[E ProcessGroupNCCL.cpp:475] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800519 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800522 milliseconds before timing out.
a6000_node2:65507:494 [5] NCCL INFO [Service thread] Connection closed by localRank 5
a6000_node2:65508:500 [6] NCCL INFO [Service thread] Connection closed by localRank 6
a6000_node2:65508:447 [6] NCCL INFO comm 0xc320220 rank 14 nranks 16 cudaDev 6 busId c1000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 14] NCCL watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800519 milliseconds before timing out.
a6000_node2:65505:452 [3] NCCL INFO comm 0xc021ee0 rank 11 nranks 16 cudaDev 3 busId 61000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 11] NCCL watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out.
a6000_node2:65509:459 [7] NCCL INFO comm 0xbc35500 rank 15 nranks 16 cudaDev 7 busId e1000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 15] NCCL watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800100 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800721 milliseconds before timing out.
a6000_node2:65503:449 [1] NCCL INFO comm 0xce5ffe0 rank 9 nranks 16 cudaDev 1 busId 25000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 9] NCCL watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
a6000_node2:65504:497 [2] NCCL INFO [Service thread] Connection closed by localRank 2
a6000_node2:65502:499 [0] NCCL INFO [Service thread] Connection closed by localRank 0
a6000_node2:65504:454 [2] NCCL INFO comm 0xc9b2f80 rank 10 nranks 16 cudaDev 2 busId 41000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 10] NCCL watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800721 milliseconds before timing out.
a6000_node2:65507:461 [5] NCCL INFO comm 0xbdb3600 rank 13 nranks 16 cudaDev 5 busId a1000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 13] NCCL watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800522 milliseconds before timing out.
a6000_node2:65502:457 [0] NCCL INFO comm 0xc2d6f80 rank 8 nranks 16 cudaDev 0 busId 1000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 8] NCCL watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[2023-12-22 06:45:43,272] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65502 closing signal SIGTERM
[2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65503 closing signal SIGTERM
[2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65504 closing signal SIGTERM
[2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65505 closing signal SIGTERM
[2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65507 closing signal SIGTERM
[2023-12-22 06:45:48,361] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 4 (pid: 65506) of binary: /root/anaconda3/envs/ljf_factory/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/ljf_factory/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 8] NCCL watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[2023-12-22 06:45:43,272] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65502 closing signal SIGTERM
[2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65503 closing signal SIGTERM
[2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65504 closing signal SIGTERM
[2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65505 closing signal SIGTERM
[2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65507 closing signal SIGTERM
[2023-12-22 06:45:48,361] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 4 (pid: 65506) of binary: /root/anaconda3/envs/ljf_factory/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/ljf_factory/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/train_bash.py FAILED
Failures: [1]: time : 2023-12-22_06:45:43 host : A6000_node2 rank : 14 (local_rank: 6) exitcode : -6 (pid: 65508) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 65508 [2]: time : 2023-12-22_06:45:43 host : A6000_node2 rank : 15 (local_rank: 7) exitcode : -6 (pid: 65509) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 65509
Root Cause (first observed failure): [0]: time : 2023-12-22_06:45:43 host : A6000_node2 rank : 12 (local_rank: 4) exitcode : -6 (pid: 65506) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 65506
any update on this issue pls? I think 4.36.2 has the same issue.
Any update now? 4.36.2 definitely have the same issue! Which is the latest version that does not have this annoying bug?
Any update now? 4.36.2 definitely have the same issue! Which is the latest version that does not have this annoying bug?
Latest V4.37.1 still has the same issue in my case...
Gentle ping @muellerzr @pacman100
I just found that setting save_on_each_node=False
in TrainingArguments works. See #28009
Also facing this issue on 4.36.2. setting save_on_each_node=False
allowed training to continue longer but I still eventually hit an error like:
FileNotFoundError: [Errno 2] No such file or directory: './output/models/tmp-checkpoint-5970' -> './output/models/checkpoint-5970'
@JohnGiorgi can you give us more information on your setup please?
- Windows/Linux/Etc
- How many GPUs?
- Is it multi-node or single node (computer)
@muellerzr Linux (Ubuntu 22.04.2 LTS), multi-node with 4 nodes and 8 GPUs per node for a total of 32 GPUs (shared file-system and network). I will note that training progressed long enough to successfully save 1 checkpoint to disk, but failed when trying to write a second checkpoint some training steps later.
@muellerzr This problem seems to be resolved on the latest version of transformers (4.37.2
)
This problem still exists 4.38.1 with multi node multi GPT training
@muellerzr This problem seems to be resolved on the latest version of transformers (
4.37.2
)
Its not resolved.
@JohnGiorgi --ddp_timeout 7200000 Try increasing the DDP timeout and see whether it works or not.
This problem still exists 4.38.1 with multi node multi GPT training
I see the same problem with 4.38.1 (multi-gpu, single node)
in trainer.py line 2555
elif self.is_local_process_zero():
# Clean up the remaining staging checkpoint folders on other nodes
if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
shutil.rmtree(staging_output_dir)
should change to
elif self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
# Clean up the remaining staging checkpoint folders on other nodes
if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
shutil.rmtree(staging_output_dir, ignore_errors=True)
This should be fine now on main, due to so many issues with the staging_dir we've fully reverted it.
@Trangle It works for me. Thanks!
@muellerzr So how to slove this problem finally?
in trainer.py line 2555
elif self.is_local_process_zero(): # Clean up the remaining staging checkpoint folders on other nodes if staging_output_dir != output_dir and os.path.exists(staging_output_dir): shutil.rmtree(staging_output_dir)
should change to
elif self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero(): # Clean up the remaining staging checkpoint folders on other nodes if staging_output_dir != output_dir and os.path.exists(staging_output_dir): shutil.rmtree(staging_output_dir, ignore_errors=True)
of course , you can resolove the problem ,but who will fix it into the main brach ?
We've fully removed/reverted the staging dir logic, so this should be a nonissue now.
in trainer.py line 2555
elif self.is_local_process_zero(): # Clean up the remaining staging checkpoint folders on other nodes if staging_output_dir != output_dir and os.path.exists(staging_output_dir): shutil.rmtree(staging_output_dir)
should change to
elif self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero(): # Clean up the remaining staging checkpoint folders on other nodes if staging_output_dir != output_dir and os.path.exists(staging_output_dir): shutil.rmtree(staging_output_dir, ignore_errors=True)
of course , you can resolove the problem ,but who will fix it into the main brach ?
which tag now if right now?
nonissue
which tag is usefule? has is been fixed?