stanford_alpaca
stanford_alpaca copied to clipboard
error of multi-GPU: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0
When I use four GPU to train the model, I meet this error, can anybody help me slove this error? Thank you very much.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77807 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77808 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77809 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 77806) of binary: /home/la/anaconda3/envs/alpaca_torch/bin/python
Traceback (most recent call last):
File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
main()
File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-30_20:18:47
host : guest-server
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 77806)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 77806
======================================================
When I use multi-GPU to run other codes, I also meet this error. Who can help me?
Can you show what is the command you used to train in multi-GPU environment?
python -m torch.distributed.run --nproc_per_node=4 --master_port=11110 train.py
--model_name_or_path ./output/path
--data_path ./alpaca_data.json
--fp16 True
--output_dir ./pretrained
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 False
Same problem, have you found the solution?
Same problem, hope to have an answer
Please attempt to install the specified version of transformer: pip install git+https://github.com/zphang/transformers.git cd transformers git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176 python setup.py install
git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176
Thank you for your advice. There is no setup.py in the transformers, only a [README.md]. So I can not install transformers.
pip install git+https://github.com/zphang/transformers.git cd transformers git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176 pip install .
you can try it, I use it solving problems
pip install git+https://github.com/zphang/transformers.git cd transformers git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176 pip install .
you can try it, I use it solving problems
Thank you for your advice. I can not run "pip install git+https://github.com/zphang/transformers.git", meet this error:
Collecting git+https://github.com/zphang/transformers.git
Cloning https://github.com/zphang/transformers.git to /tmp/pip-req-build-8bfk9e3m
Running command git clone --quiet https://github.com/zphang/transformers.git /tmp/pip-req-build-8bfk9e3m
Resolved https://github.com/zphang/transformers.git to commit 63a9d6745f679b2eb882e0f147828380981111fa
ERROR: git+https://github.com/zphang/transformers.git does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.
I download the transformers from the ”https://github.com/zphang/transformers“ and "cd transformers" run "git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176" meet this error:
Unknown option: --reset
usage: git [--version] [--help] [-C <path>] [-c name=value]
[--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
[-p | --paginate | --no-pager] [--no-replace-objects] [--bare]
[--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
<command> [<args>]
Is there something wrong done by me?
sorry, maybe my suggestion last time was wrong. Your transformers repository is wrong. Please try follow code, it's my operation: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install . If you are Chinese, you can read this link:https://zhuanlan.zhihu.com/p/618321077 , I followed his steps and succeed.
pip install .
Thank you very much. My problem is solved follow your suggestion.
0041be5
I have another question. I meet the same error in runing other models. This method do not solve my error. I guess the "0041be5" should change when runing other models (such as GLM130B). How to change the branch name "0041be5"?
I think you may need different virtual python environment to train different model. And I don't know the version of transformers which GLM130B needs, so you'd better to ask their developer or read their guide.
我按照这个来了,但还是不行,请问还有解决办法吗
Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea
Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea
can you show the error you met?
我按照这个来了,但还是不行,请问还有解决办法吗
具体是什么问题呢
Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea
can you show the error you met?
always show exitcode -9, my config: GPU V100 16G * 3, CPU RAM 128G , is RAM not enough ? thx for you reply
but i also to monti
Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea
can you show the error you met?
always show exitcode -9, my config: GPU V100 16G * 3, CPU RAM 128G , is RAM not enough ? thx for you reply
But I only find 70% of the RAM be used on the backend
oh, my friend, this is not the main reason, you should let me see the exception above this. And your RAM is enough, my device is less than yours.
Have you ever tried this:
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout 0041be5
pip install .
maybe it works.
Have you ever tried this:
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout 0041be5
pip install .
maybe it works.
Hi, I have tried this method, but still got this problem, do you have any idea about this? The version of transformers I used is 4.29.0.dev0. Thanks in advance!
2023-04-26 07:19:29.474990: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2023-04-26 07:19:35,696] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-04-26 07:19:53,218] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100% 33/33 [01:08<00:00, 2.09s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.060922384262085 seconds
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.49018430709838867 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9574) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
/content/drive/MyDrive/codealpaca/train.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-26_07:22:29
host : 56de1ccd4f0e
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 9574)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 9574
=====================================================
Have you ever tried this:
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout 0041be5
pip install .
maybe it works.
thank for you reply, i have follow you zhihu step by step.
my transformers version
(llmenv3) [xlwu@mochinelearning transformers]$ git checkout 0041be5
HEAD is now at 0041be5b3 LLaMA Implementation (#21955)
Then i run train script:
TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO CUDA_VISIBLE_DEVICES=0,1,2 torchrun \
--nproc_per_node=3 \
--master_port=25001 train.py \
--model_name_or_path /DATA/cdisk/xlwu_workspace/pretrain_model/hf-llama-model/llama-7b \
--data_path /DATA/cdisk/xlwu_workspace/data/test.json \
--output_dir /DATA/cdisk/xlwu_workspace/output/alpaca/sft_7b \
--per_device_eval_batch_size 1 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "tensorboard" \
--gradient_checkpointing True \
--fp16 True \
--deepspeed ds_config.json
the error show:
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 3
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:25001
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:25001.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:22302.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:22304.
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=25001
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2]
role_ranks=[0, 1, 2]
global_ranks=[0, 1, 2]
role_world_sizes=[3, 3, 3]
global_world_sizes=[3, 3, 3]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/2/error.json
[2023-04-26 16:16:46,603] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29566.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29568.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29570.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29572.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29574.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29576.
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
mochinelearning:3293412:3293412 [0] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293412:3293412 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293412:3293412 [0] NCCL INFO NET/IB : No device found.
mochinelearning:3293412:3293412 [0] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293412:3293412 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
mochinelearning:3293413:3293413 [1] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293414:3293414 [2] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293414:3293414 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293413:3293413 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293414:3293414 [2] NCCL INFO NET/IB : No device found.
mochinelearning:3293413:3293413 [1] NCCL INFO NET/IB : No device found.
mochinelearning:3293414:3293414 [2] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293413:3293413 [1] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293414:3293414 [2] NCCL INFO Using network Socket
mochinelearning:3293413:3293413 [1] NCCL INFO Using network Socket
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 00/04 : 0 1 2
mochinelearning:3293414:3293480 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] 1/-1/-1->2->-1 [2] -1/-1/-1->2->1 [3] 1/-1/-1->2->-1
mochinelearning:3293413:3293481 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 0/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] 0/-1/-1->1->2
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01/04 : 0 2 1
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 02/04 : 0 1 2
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03/04 : 0 2 1
mochinelearning:3293412:3293479 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 00 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 00 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 02 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 02 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 02 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 01 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 03 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 03 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Connected all rings
mochinelearning:3293414:3293480 [2] NCCL INFO Connected all rings
mochinelearning:3293413:3293481 [1] NCCL INFO Connected all rings
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 01 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 03 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 00 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 02 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 02 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Connected all trees
mochinelearning:3293414:3293480 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293414:3293480 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293479 [0] NCCL INFO Connected all trees
mochinelearning:3293412:3293479 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293412:3293479 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293413:3293481 [1] NCCL INFO Connected all trees
mochinelearning:3293413:3293481 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293413:3293481 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293414:3293480 [2] NCCL INFO comm 0x7fe1ec002fb0 rank 2 nranks 3 cudaDev 2 busId a0 - Init COMPLETE
mochinelearning:3293413:3293481 [1] NCCL INFO comm 0x7fda68002fb0 rank 1 nranks 3 cudaDev 1 busId 90 - Init COMPLETE
mochinelearning:3293412:3293479 [0] NCCL INFO comm 0x7fe718002fb0 rank 0 nranks 3 cudaDev 0 busId 80 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
mochinelearning:3293412:3293412 [0] NCCL INFO Launch mode Parallel
[2023-04-26 16:16:56,205] [INFO] [partition_parameters.py:413:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00, 2.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00, 2.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00, 2.98s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 00/04 : 0 1 2
mochinelearning:3293413:3293877 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 0/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] 0/-1/-1->1->2
mochinelearning:3293414:3293876 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] 1/-1/-1->2->-1 [2] -1/-1/-1->2->1 [3] 1/-1/-1->2->-1
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01/04 : 0 2 1
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 02/04 : 0 1 2
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03/04 : 0 2 1
mochinelearning:3293412:3293875 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 00 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 00 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 02 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 02 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 02 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 01 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 03 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 03 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Connected all rings
mochinelearning:3293412:3293875 [0] NCCL INFO Connected all rings
mochinelearning:3293413:3293877 [1] NCCL INFO Connected all rings
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 01 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 03 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 00 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 02 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 02 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Connected all trees
mochinelearning:3293414:3293876 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293414:3293876 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293875 [0] NCCL INFO Connected all trees
mochinelearning:3293412:3293875 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293412:3293875 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293413:3293877 [1] NCCL INFO Connected all trees
mochinelearning:3293413:3293877 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293413:3293877 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293875 [0] NCCL INFO comm 0x7fe59c002fb0 rank 0 nranks 3 cudaDev 0 busId 80 - Init COMPLETE
mochinelearning:3293414:3293876 [2] NCCL INFO comm 0x7fe064002fb0 rank 2 nranks 3 cudaDev 2 busId a0 - Init COMPLETE
mochinelearning:3293413:3293877 [1] NCCL INFO comm 0x7fd8e0002fb0 rank 1 nranks 3 cudaDev 1 busId 90 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
mochinelearning:3293412:3293412 [0] NCCL INFO Launch mode Parallel
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/xlwu/.cache/torch_extensions/py310_cu113/cpu_adam...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/xlwu/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.544737577438354 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.598078966140747 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.598520278930664 seconds
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/xlwu/.cache/torch_extensions/py310_cu113/utils...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/xlwu/.cache/torch_extensions/py310_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
[2/2] c++ flatten_unflatten.o -shared -L/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 15.1830472946167 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 15.121537685394287 seconds
Time to load utils op: 15.222201108932495 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3293412 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3293414 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 3293413) of binary: /DATA/xlwu/anconda3/envs/llmenv3/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0019960403442382812 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
File "/DATA/xlwu/anconda3/envs/llmenv3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
train.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-26_16:19:53
host : mochinelearning
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 3293413)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3293413
========================================================
your specific error is: Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
You should make sure the version of cuda and torch is matching
Have you ever tried this:
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout 0041be5
pip install .
maybe it works.Hi, I have tried this method, but still got this problem, do you have any idea about this? The version of transformers I used is 4.29.0.dev0. Thanks in advance!
2023-04-26 07:19:29.474990: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [2023-04-26 07:19:35,696] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-04-26 07:19:53,218] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters Loading checkpoint shards: 100% 33/33 [01:08<00:00, 2.09s/it] Using pad_token, but it is not set yet. WARNING:root:Loading data... WARNING:root:Formatting inputs... WARNING:root:Tokenizing inputs... This may take some time... Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 3.060922384262085 seconds Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.49018430709838867 seconds Parameter Offload: Total persistent parameters: 0 in 0 params ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9574) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== /content/drive/MyDrive/codealpaca/train.py FAILED ----------------------------------------------------- Failures: <NO_OTHER_FAILURES> ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-04-26_07:22:29 host : 56de1ccd4f0e rank : 0 (local_rank: 0) exitcode : -9 (pid: 9574) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 9574 =====================================================
my transformers version is4.28.0.dev0 maybe something wrong, please check your step
如果大家可以用中文交流或许可以减少一些沟通成本...
This error might occur when there is not enough RAM available. It can happen when using FSDP multiprocess and transformers "from_pretrained" method where each process loads the checkpoint. As a result, the memory usage becomes num_processes * (model_size + size_of_largest_shard)
, leading to process crashes.
To tackle this issue, we can use DeepSpeed instead of FSDP. DeepSpeed optimizes initialization CPU memory usage, and it only uses num_processes * size_of_largest_shard
RAM.
The error show:
Traceback (most recent call last):
File "tools/train.py", line 194, in
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17013 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17014 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17015 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 17016) of binary: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python
Traceback (most recent call last):
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-05-14_16:36:57 host : user-SYS-7049GP-TRT rank : 3 (local_rank: 3) exitcode : -6 (pid: 17016) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 17016
如果大家可以用中文交流或许可以减少一些沟通成本...
你好,我现在有这个error,请问你知道是什么原因吗?我的transformers是4.28.0.dev0
[2023-06-01 09:44:26,442] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-06-01 09:44:38,504] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:50<00:00, 1.53s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.2360477447509766 seconds
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3748812675476074 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 308887) of binary: /opt/conda/bin/python3.8
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-01_09:47:02
host : alpaca-6655dbbbc6-btc9j
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 308887)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 308887
=======================================================