Save model checkpoint error when multi-gpu training still happens on 4.36.1

Open z7ye opened this issue 6 months ago • 32 comments

System Info

platform: linux python: 3.9 transformers: 4.36.1 running on two A10.2

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

I saw the release notes of 4.36.1 says this error already fixed, however, it still repeats after I install the latest version when I am running on a two A10.2 machine.

                                                 Traceback (most recent call last):
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2023-12-17 18:09:08 10.0.1.12:     return _run_code(code, main_globals, None,
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/runpy.py", line 87, in _run_code
2023-12-17 18:09:08 10.0.1.12:     exec(code, run_globals)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/decompressed_artifact/code/src/axolotl/cli/train.py", line 38, in <module>
2023-12-17 18:09:08 10.0.1.12:     fire.Fire(do_cli)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
2023-12-17 18:09:08 10.0.1.12:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
2023-12-17 18:09:08 10.0.1.12:     component, remaining_args = _CallAndUpdateTrace(
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2023-12-17 18:09:08 10.0.1.12:     component = fn(*varargs, **kwargs)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/decompressed_artifact/code/src/axolotl/cli/train.py", line 34, in do_cli
2023-12-17 18:09:08 10.0.1.12:     train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/decompressed_artifact/code/src/axolotl/train.py", line 126, in train
2023-12-17 18:09:08 10.0.1.12:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
2023-12-17 18:09:08 10.0.1.12:     return inner_training_loop(
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
2023-12-17 18:09:08 10.0.1.12:     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
2023-12-17 18:09:08 10.0.1.12:     self._save_checkpoint(model, trial, metrics=metrics)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 2376, in _save_checkpoint
2023-12-17 18:09:08 10.0.1.12:     self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer_callback.py", line 114, in save_to_json
2023-12-17 18:09:08 10.0.1.12:     with open(json_path, "w", encoding="utf-8") as f:
2023-12-17 18:09:08 10.0.1.12: FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-1080/trainer_state.json'

Expected behavior

expect it to work.

Dec 18 '23 18:12 z7ye

Hi @z7ye, thanks for raising this issue!

Could you provide a minimal code snippet we can use to reproduce this error?

cc @muellerzr @pacman100

Dec 18 '23 18:12 amyeroberts

And please upgrade to 4.36.2

Dec 18 '23 19:12 muellerzr

And please upgrade to 4.36.2

This problem occurs in training with multiple machines and multiple cards. Perhaps 4.36.2 did not solve this problem either, as 4.36.1 has already attempted to check for the presence of "stagg_output_dir" in "main_process".

Dec 20 '23 12:12 Trangle

Thanks, I’ll look into this

Dec 20 '23 12:12 muellerzr

And please upgrade to 4.36.2

This problem occurs in training with multiple machines and multiple cards. Perhaps 4.36.2 did not solve this problem either, as 4.36.1 has already attempted to check for the presence of "stagg_output_dir" in "main_process".

Yes, 4.36.2 also suffers from the same problem, even though #28078 has been updated.

Dec 20 '23 14:12 ShaneTian

https://github.com/huggingface/transformers/pull/27929#issuecomment-1853861756

This adhoc can fix the problem. It works in my case

Dec 20 '23 18:12 hieu-blackbox

@ShaneTian or @hieu-blackbox can you please try pip install git+https://github.com/huggingface/transformers@muellerzr-multinode-save? It's an alternative we can try as I agree I believe the issue exists only when we don't have a shared file system.

Dec 21 '23 16:12 muellerzr

I see the error on 4.36.2 version as well, and I have a shared file system across each node. Using 2 nodes with 8 H100 gpus on each nodes.

Dec 22 '23 19:12 imraviagrawal

或者你能试试吗？这是我们可以尝试的替代方案，因为我同意我相信只有当我们没有共享文件系统时才存在问题。pip install git+https://github.com/huggingface/transformers@muellerzr-multinode-save

After updating the code, deepspeed starts the cluster and saves the checkpoint named tmp checkpoint-10 from the node. The host point is checkpoint-10. After saving the checkpoint-10, Watchdog cause collective operation timeout occurs and the cluster training is interrupted

48%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 10/21 [33:20<35:45, 195.01s/it]/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-12-22 06:15:36,199] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/zero_pp_rank_8_mp_rank_00_model_states.pt... [2023-12-22 06:15:39,569] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/zero_pp_rank_8_mp_rank_00_model_states.pt. [2023-12-22 06:15:39,576] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... [2023-12-22 06:15:39,700] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. [2023-12-22 06:15:39,701] [INFO] [engine.py:3428:_save_zero_checkpoint] zero checkpoint saved /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt [2023-12-22 06:15:39,764] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now! [E ProcessGroupNCCL.cpp:475] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800100 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800116 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out. a6000_node2:65506:493 [4] NCCL INFO [Service thread] Connection closed by localRank 4 a6000_node2:65509:495 [7] NCCL INFO [Service thread] Connection closed by localRank 7 a6000_node2:65505:498 [3] NCCL INFO [Service thread] Connection closed by localRank 3 a6000_node2:65506:465 [4] NCCL INFO comm 0xd875220 rank 12 nranks 16 cudaDev 4 busId 81000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 12] NCCL watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800116 milliseconds before timing out. a6000_node2:65503:496 [1] NCCL INFO [Service thread] Connection closed by localRank 1 [E ProcessGroupNCCL.cpp:475] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800519 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800522 milliseconds before timing out. a6000_node2:65507:494 [5] NCCL INFO [Service thread] Connection closed by localRank 5 a6000_node2:65508:500 [6] NCCL INFO [Service thread] Connection closed by localRank 6 a6000_node2:65508:447 [6] NCCL INFO comm 0xc320220 rank 14 nranks 16 cudaDev 6 busId c1000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 14] NCCL watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800519 milliseconds before timing out. a6000_node2:65505:452 [3] NCCL INFO comm 0xc021ee0 rank 11 nranks 16 cudaDev 3 busId 61000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 11] NCCL watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out. a6000_node2:65509:459 [7] NCCL INFO comm 0xbc35500 rank 15 nranks 16 cudaDev 7 busId e1000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 15] NCCL watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800100 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800721 milliseconds before timing out. a6000_node2:65503:449 [1] NCCL INFO comm 0xce5ffe0 rank 9 nranks 16 cudaDev 1 busId 25000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 9] NCCL watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. a6000_node2:65504:497 [2] NCCL INFO [Service thread] Connection closed by localRank 2 a6000_node2:65502:499 [0] NCCL INFO [Service thread] Connection closed by localRank 0 a6000_node2:65504:454 [2] NCCL INFO comm 0xc9b2f80 rank 10 nranks 16 cudaDev 2 busId 41000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 10] NCCL watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800721 milliseconds before timing out. a6000_node2:65507:461 [5] NCCL INFO comm 0xbdb3600 rank 13 nranks 16 cudaDev 5 busId a1000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 13] NCCL watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800522 milliseconds before timing out. a6000_node2:65502:457 [0] NCCL INFO comm 0xc2d6f80 rank 8 nranks 16 cudaDev 0 busId 1000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 8] NCCL watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. [2023-12-22 06:45:43,272] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65502 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65503 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65504 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65505 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65507 closing signal SIGTERM [2023-12-22 06:45:48,361] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 4 (pid: 65506) of binary: /root/anaconda3/envs/ljf_factory/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/ljf_factory/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 8] NCCL watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. [2023-12-22 06:45:43,272] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65502 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65503 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65504 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65505 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65507 closing signal SIGTERM [2023-12-22 06:45:48,361] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 4 (pid: 65506) of binary: /root/anaconda3/envs/ljf_factory/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/ljf_factory/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train_bash.py FAILED

Failures: [1]: time : 2023-12-22_06:45:43 host : A6000_node2 rank : 14 (local_rank: 6) exitcode : -6 (pid: 65508) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 65508 [2]: time : 2023-12-22_06:45:43 host : A6000_node2 rank : 15 (local_rank: 7) exitcode : -6 (pid: 65509) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 65509

Root Cause (first observed failure): [0]: time : 2023-12-22_06:45:43 host : A6000_node2 rank : 12 (local_rank: 4) exitcode : -6 (pid: 65506) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 65506

Dec 23 '23 02:12 vip-china

any update on this issue pls? I think 4.36.2 has the same issue.

Jan 19 '24 06:01 z7ye

Any update now? 4.36.2 definitely have the same issue! Which is the latest version that does not have this annoying bug?

Jan 25 '24 12:01 mayiran1999

Any update now? 4.36.2 definitely have the same issue! Which is the latest version that does not have this annoying bug?

Latest V4.37.1 still has the same issue in my case...

Jan 25 '24 13:01 mayiran1999

Gentle ping @muellerzr @pacman100

Jan 25 '24 14:01 amyeroberts

I just found that setting save_on_each_node=False in TrainingArguments works. See #28009

Jan 25 '24 14:01 mayiran1999

Also facing this issue on 4.36.2. setting save_on_each_node=False allowed training to continue longer but I still eventually hit an error like:

FileNotFoundError: [Errno 2] No such file or directory: './output/models/tmp-checkpoint-5970' -> './output/models/checkpoint-5970'

Feb 07 '24 13:02 JohnGiorgi

@JohnGiorgi can you give us more information on your setup please?

Windows/Linux/Etc
How many GPUs?
Is it multi-node or single node (computer)

Feb 07 '24 14:02 muellerzr

@muellerzr Linux (Ubuntu 22.04.2 LTS), multi-node with 4 nodes and 8 GPUs per node for a total of 32 GPUs (shared file-system and network). I will note that training progressed long enough to successfully save 1 checkpoint to disk, but failed when trying to write a second checkpoint some training steps later.

Feb 07 '24 15:02 JohnGiorgi

@muellerzr This problem seems to be resolved on the latest version of transformers (4.37.2)

Feb 09 '24 02:02 JohnGiorgi

This problem still exists 4.38.1 with multi node multi GPT training

Feb 26 '24 14:02 voidmagic

@muellerzr This problem seems to be resolved on the latest version of transformers (4.37.2)

Its not resolved.

Feb 27 '24 15:02 sahilqure

@JohnGiorgi --ddp_timeout 7200000 Try increasing the DDP timeout and see whether it works or not.

Feb 27 '24 15:02 sahilqure

This problem still exists 4.38.1 with multi node multi GPT training

I see the same problem with 4.38.1 (multi-gpu, single node)

Feb 27 '24 17:02 Ravisutha

in trainer.py line 2555

        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir)

should change to

        elif self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir, ignore_errors=True)

Feb 29 '24 06:02 Trangle

This should be fine now on main, due to so many issues with the staging_dir we've fully reverted it.

Mar 04 '24 14:03 muellerzr

@Trangle It works for me. Thanks!

Mar 15 '24 13:03 MangoFF

@muellerzr So how to slove this problem finally?

Mar 21 '24 14:03 YinHan-Zhang

in trainer.py line 2555

        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir)

should change to

        elif self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir, ignore_errors=True)

of course , you can resolove the problem ,but who will fix it into the main brach ?

Apr 04 '24 09:04 ldh127

We've fully removed/reverted the staging dir logic, so this should be a nonissue now.

Apr 04 '24 13:04 muellerzr

in trainer.py line 2555

        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir)

should change to

        elif self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir, ignore_errors=True)

of course , you can resolove the problem ,but who will fix it into the main brach ?

which tag now if right now?

Apr 15 '24 04:04 zheng5yu9

nonissue

which tag is usefule? has is been fixed?

Apr 15 '24 04:04 zheng5yu9

transformers transformers copied to clipboard

Save model checkpoint error when multi-gpu training still happens on 4.36.1

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

src/train_bash.py FAILED

transformers
transformers copied to clipboard