transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Save model checkpoint error when multi-gpu training still happens on 4.36.1

Open z7ye opened this issue 6 months ago โ€ข 32 comments

System Info

platform: linux python: 3.9 transformers: 4.36.1 running on two A10.2

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [x] My own task or dataset (give details below)

Reproduction

I saw the release notes of 4.36.1 says this error already fixed, however, it still repeats after I install the latest version when I am running on a two A10.2 machine.

                                                 Traceback (most recent call last):
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2023-12-17 18:09:08 10.0.1.12:     return _run_code(code, main_globals, None,
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/runpy.py", line 87, in _run_code
2023-12-17 18:09:08 10.0.1.12:     exec(code, run_globals)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/decompressed_artifact/code/src/axolotl/cli/train.py", line 38, in <module>
2023-12-17 18:09:08 10.0.1.12:     fire.Fire(do_cli)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
2023-12-17 18:09:08 10.0.1.12:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
2023-12-17 18:09:08 10.0.1.12:     component, remaining_args = _CallAndUpdateTrace(
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2023-12-17 18:09:08 10.0.1.12:     component = fn(*varargs, **kwargs)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/decompressed_artifact/code/src/axolotl/cli/train.py", line 34, in do_cli
2023-12-17 18:09:08 10.0.1.12:     train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/decompressed_artifact/code/src/axolotl/train.py", line 126, in train
2023-12-17 18:09:08 10.0.1.12:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
2023-12-17 18:09:08 10.0.1.12:     return inner_training_loop(
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
2023-12-17 18:09:08 10.0.1.12:     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
2023-12-17 18:09:08 10.0.1.12:     self._save_checkpoint(model, trial, metrics=metrics)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 2376, in _save_checkpoint
2023-12-17 18:09:08 10.0.1.12:     self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer_callback.py", line 114, in save_to_json
2023-12-17 18:09:08 10.0.1.12:     with open(json_path, "w", encoding="utf-8") as f:
2023-12-17 18:09:08 10.0.1.12: FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-1080/trainer_state.json'

Expected behavior

expect it to work.

z7ye avatar Dec 18 '23 18:12 z7ye

Hi @z7ye, thanks for raising this issue!

Could you provide a minimal code snippet we can use to reproduce this error?

cc @muellerzr @pacman100

amyeroberts avatar Dec 18 '23 18:12 amyeroberts

And please upgrade to 4.36.2

muellerzr avatar Dec 18 '23 19:12 muellerzr

And please upgrade to 4.36.2

This problem occurs in training with multiple machines and multiple cards. Perhaps 4.36.2 did not solve this problem either, as 4.36.1 has already attempted to check for the presence of "stagg_output_dir" in "main_process".

Trangle avatar Dec 20 '23 12:12 Trangle

Thanks, Iโ€™ll look into this

muellerzr avatar Dec 20 '23 12:12 muellerzr

And please upgrade to 4.36.2

This problem occurs in training with multiple machines and multiple cards. Perhaps 4.36.2 did not solve this problem either, as 4.36.1 has already attempted to check for the presence of "stagg_output_dir" in "main_process".

Yes, 4.36.2 also suffers from the same problem, even though #28078 has been updated.

ShaneTian avatar Dec 20 '23 14:12 ShaneTian

https://github.com/huggingface/transformers/pull/27929#issuecomment-1853861756

This adhoc can fix the problem. It works in my case

hieu-blackbox avatar Dec 20 '23 18:12 hieu-blackbox

@ShaneTian or @hieu-blackbox can you please try pip install git+https://github.com/huggingface/transformers@muellerzr-multinode-save? It's an alternative we can try as I agree I believe the issue exists only when we don't have a shared file system.

muellerzr avatar Dec 21 '23 16:12 muellerzr

I see the error on 4.36.2 version as well, and I have a shared file system across each node. Using 2 nodes with 8 H100 gpus on each nodes.

imraviagrawal avatar Dec 22 '23 19:12 imraviagrawal

ๆˆ–่€…ไฝ ่ƒฝ่ฏ•่ฏ•ๅ—๏ผŸ่ฟ™ๆ˜ฏๆˆ‘ไปฌๅฏไปฅๅฐ่ฏ•็š„ๆ›ฟไปฃๆ–นๆกˆ๏ผŒๅ› ไธบๆˆ‘ๅŒๆ„ๆˆ‘็›ธไฟกๅชๆœ‰ๅฝ“ๆˆ‘ไปฌๆฒกๆœ‰ๅ…ฑไบซๆ–‡ไปถ็ณป็ปŸๆ—ถๆ‰ๅญ˜ๅœจ้—ฎ้ข˜ใ€‚pip install git+https://github.com/huggingface/transformers@muellerzr-multinode-save

After updating the code, deepspeed starts the cluster and saves the checkpoint named tmp checkpoint-10 from the node. The host point is checkpoint-10. After saving the checkpoint-10, Watchdog cause collective operation timeout occurs and the cluster training is interrupted

48%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 10/21 [33:20<35:45, 195.01s/it]/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-12-22 06:15:36,199] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/zero_pp_rank_8_mp_rank_00_model_states.pt... [2023-12-22 06:15:39,569] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/zero_pp_rank_8_mp_rank_00_model_states.pt. [2023-12-22 06:15:39,576] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... [2023-12-22 06:15:39,700] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. [2023-12-22 06:15:39,701] [INFO] [engine.py:3428:_save_zero_checkpoint] zero checkpoint saved /data1/liujifan/data/sft_out/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt [2023-12-22 06:15:39,764] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now! [E ProcessGroupNCCL.cpp:475] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800100 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800116 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out. a6000_node2:65506:493 [4] NCCL INFO [Service thread] Connection closed by localRank 4 a6000_node2:65509:495 [7] NCCL INFO [Service thread] Connection closed by localRank 7 a6000_node2:65505:498 [3] NCCL INFO [Service thread] Connection closed by localRank 3 a6000_node2:65506:465 [4] NCCL INFO comm 0xd875220 rank 12 nranks 16 cudaDev 4 busId 81000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 12] NCCL watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800116 milliseconds before timing out. a6000_node2:65503:496 [1] NCCL INFO [Service thread] Connection closed by localRank 1 [E ProcessGroupNCCL.cpp:475] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800519 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800522 milliseconds before timing out. a6000_node2:65507:494 [5] NCCL INFO [Service thread] Connection closed by localRank 5 a6000_node2:65508:500 [6] NCCL INFO [Service thread] Connection closed by localRank 6 a6000_node2:65508:447 [6] NCCL INFO comm 0xc320220 rank 14 nranks 16 cudaDev 6 busId c1000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 14] NCCL watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800519 milliseconds before timing out. a6000_node2:65505:452 [3] NCCL INFO comm 0xc021ee0 rank 11 nranks 16 cudaDev 3 busId 61000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 11] NCCL watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out. a6000_node2:65509:459 [7] NCCL INFO comm 0xbc35500 rank 15 nranks 16 cudaDev 7 busId e1000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 15] NCCL watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800100 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800721 milliseconds before timing out. a6000_node2:65503:449 [1] NCCL INFO comm 0xce5ffe0 rank 9 nranks 16 cudaDev 1 busId 25000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 9] NCCL watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800135 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. a6000_node2:65504:497 [2] NCCL INFO [Service thread] Connection closed by localRank 2 a6000_node2:65502:499 [0] NCCL INFO [Service thread] Connection closed by localRank 0 a6000_node2:65504:454 [2] NCCL INFO comm 0xc9b2f80 rank 10 nranks 16 cudaDev 2 busId 41000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 10] NCCL watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800721 milliseconds before timing out. a6000_node2:65507:461 [5] NCCL INFO comm 0xbdb3600 rank 13 nranks 16 cudaDev 5 busId a1000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 13] NCCL watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800522 milliseconds before timing out. a6000_node2:65502:457 [0] NCCL INFO comm 0xc2d6f80 rank 8 nranks 16 cudaDev 0 busId 1000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 8] NCCL watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. [2023-12-22 06:45:43,272] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65502 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65503 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65504 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65505 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65507 closing signal SIGTERM [2023-12-22 06:45:48,361] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 4 (pid: 65506) of binary: /root/anaconda3/envs/ljf_factory/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/ljf_factory/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 8] NCCL watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21668, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. [2023-12-22 06:45:43,272] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65502 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65503 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65504 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65505 closing signal SIGTERM [2023-12-22 06:45:43,273] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65507 closing signal SIGTERM [2023-12-22 06:45:48,361] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 4 (pid: 65506) of binary: /root/anaconda3/envs/ljf_factory/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/ljf_factory/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/ljf_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train_bash.py FAILED

Failures: [1]: time : 2023-12-22_06:45:43 host : A6000_node2 rank : 14 (local_rank: 6) exitcode : -6 (pid: 65508) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 65508 [2]: time : 2023-12-22_06:45:43 host : A6000_node2 rank : 15 (local_rank: 7) exitcode : -6 (pid: 65509) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 65509

Root Cause (first observed failure): [0]: time : 2023-12-22_06:45:43 host : A6000_node2 rank : 12 (local_rank: 4) exitcode : -6 (pid: 65506) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 65506

vip-china avatar Dec 23 '23 02:12 vip-china

any update on this issue pls? I think 4.36.2 has the same issue.

z7ye avatar Jan 19 '24 06:01 z7ye

Any update now? 4.36.2 definitely have the same issue! Which is the latest version that does not have this annoying bug?

mayiran1999 avatar Jan 25 '24 12:01 mayiran1999

Any update now? 4.36.2 definitely have the same issue! Which is the latest version that does not have this annoying bug?

Latest V4.37.1 still has the same issue in my case...

mayiran1999 avatar Jan 25 '24 13:01 mayiran1999

Gentle ping @muellerzr @pacman100

amyeroberts avatar Jan 25 '24 14:01 amyeroberts

I just found that setting save_on_each_node=False in TrainingArguments works. See #28009

mayiran1999 avatar Jan 25 '24 14:01 mayiran1999

Also facing this issue on 4.36.2. setting save_on_each_node=False allowed training to continue longer but I still eventually hit an error like:

FileNotFoundError: [Errno 2] No such file or directory: './output/models/tmp-checkpoint-5970' -> './output/models/checkpoint-5970'

JohnGiorgi avatar Feb 07 '24 13:02 JohnGiorgi

@JohnGiorgi can you give us more information on your setup please?

  1. Windows/Linux/Etc
  2. How many GPUs?
  3. Is it multi-node or single node (computer)

muellerzr avatar Feb 07 '24 14:02 muellerzr

@muellerzr Linux (Ubuntu 22.04.2 LTS), multi-node with 4 nodes and 8 GPUs per node for a total of 32 GPUs (shared file-system and network). I will note that training progressed long enough to successfully save 1 checkpoint to disk, but failed when trying to write a second checkpoint some training steps later.

JohnGiorgi avatar Feb 07 '24 15:02 JohnGiorgi

@muellerzr This problem seems to be resolved on the latest version of transformers (4.37.2)

JohnGiorgi avatar Feb 09 '24 02:02 JohnGiorgi

This problem still exists 4.38.1 with multi node multi GPT training

voidmagic avatar Feb 26 '24 14:02 voidmagic

@muellerzr This problem seems to be resolved on the latest version of transformers (4.37.2)

Its not resolved.

sahilqure avatar Feb 27 '24 15:02 sahilqure

@JohnGiorgi --ddp_timeout 7200000 Try increasing the DDP timeout and see whether it works or not.

sahilqure avatar Feb 27 '24 15:02 sahilqure

This problem still exists 4.38.1 with multi node multi GPT training

I see the same problem with 4.38.1 (multi-gpu, single node)

Ravisutha avatar Feb 27 '24 17:02 Ravisutha

in trainer.py line 2555

        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir)

should change to

        elif self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir, ignore_errors=True)

Trangle avatar Feb 29 '24 06:02 Trangle

This should be fine now on main, due to so many issues with the staging_dir we've fully reverted it.

muellerzr avatar Mar 04 '24 14:03 muellerzr

@Trangle It works for me. Thanks!

MangoFF avatar Mar 15 '24 13:03 MangoFF

@muellerzr So how to slove this problem finally?

YinHan-Zhang avatar Mar 21 '24 14:03 YinHan-Zhang

in trainer.py line 2555

        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir)

should change to

        elif self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir, ignore_errors=True)

of course , you can resolove the problem ,but who will fix it into the main brach ?

ldh127 avatar Apr 04 '24 09:04 ldh127

We've fully removed/reverted the staging dir logic, so this should be a nonissue now.

muellerzr avatar Apr 04 '24 13:04 muellerzr

in trainer.py line 2555

        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir)

should change to

        elif self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir, ignore_errors=True)

of course , you can resolove the problem ,but who will fix it into the main brach ?

which tag now if right now?

zheng5yu9 avatar Apr 15 '24 04:04 zheng5yu9

nonissue

which tag is usefule? has is been fixed?

zheng5yu9 avatar Apr 15 '24 04:04 zheng5yu9