ColossalAI
ColossalAI copied to clipboard
[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 812917) of binary
🐛 Describe the bug
colossalai run --nproc_per_node=4 train_sft.py \
> --pretrain "/data/chenhao/train/ColossalAI/to/llama-7b-hf/" \
> --model 'llama' \
> --strategy colossalai_zero2 \
> --log_interval 10 \
> --save_path "/data/chenhao/train/ColossalAI/Coati-7B" \
> --dataset "/data/chenhao/train/ColossalAI/data.json" \
> --batch_size 4 \
> --accimulation_steps 8 \
> --lr 2e-5 \
> --max_datasets_size 512 \
> --max_epochs 1
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[04/10/23 16:14:26] INFO colossalai - colossalai - INFO:
/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
thon3.10/site-packages/colossalai/context/parallel_
context.py:522 set_device
[04/10/23 16:14:26] INFO colossalai - colossalai - INFO:
/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
thon3.10/site-packages/colossalai/context/parallel_
context.py:522 set_device
[04/10/23 16:14:26] INFO colossalai - colossalai - INFO:
/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
thon3.10/site-packages/colossalai/context/parallel_
context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 2 is
bound to device 2
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
INFO colossalai - colossalai - INFO: process rank 1 is
bound to device 1
[04/10/23 16:14:26] INFO colossalai - colossalai - INFO:
/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
thon3.10/site-packages/colossalai/context/parallel_
context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 3 is
bound to device 3
[04/10/23 16:14:35] INFO colossalai - colossalai - INFO:
/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
thon3.10/site-packages/colossalai/context/parallel_
context.py:558 set_seed
INFO colossalai - colossalai - INFO: initialized seed on
rank 1, numpy: 42, python random: 42,
ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the
default parallel seed is ParallelMode.DATA.
[04/10/23 16:14:35] INFO colossalai - colossalai - INFO:
/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
thon3.10/site-packages/colossalai/context/parallel_
context.py:558 set_seed
[04/10/23 16:14:35] INFO colossalai - colossalai - INFO:
/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
thon3.10/site-packages/colossalai/context/parallel_
context.py:558 set_seed
INFO colossalai - colossalai - INFO: initialized seed on
rank 2, numpy: 42, python random: 42,
ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the
default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on
rank 3, numpy: 42, python random: 42,
ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the
default parallel seed is ParallelMode.DATA.
[04/10/23 16:14:35] INFO colossalai - colossalai - INFO:
/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
thon3.10/site-packages/colossalai/context/parallel_
context.py:558 set_seed
INFO colossalai - colossalai - INFO: initialized seed on
rank 0, numpy: 42, python random: 42,
ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the
default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
thon3.10/site-packages/colossalai/initialize.py:115
launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 4,
pipeline parallel size: 1, tensor parallel size: 1
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 812914 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 812915 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 812916 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 812917) of binary: /data/chenhao/anaconda3/envs/ColossalAI-Chat/bin/python
Traceback (most recent call last):
File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train_sft.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-10_16:15:28
host : xd-dev8
rank : 3 (local_rank: 3)
exitcode : -9 (pid: 812917)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 812917
=======================================================
Error: failed to run torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_sft.py --pretrain /data/chenhao/train/ColossalAI/to/llama-7b-hf/ --model llama --strategy colossalai_zero2 --log_interval 10 --save_path /data/chenhao/train/ColossalAI/Coati-7B --dataset /data/chenhao/train/ColossalAI/data.json --batch_size 4 --accimulation_steps 8 --lr 2e-5 --max_datasets_size 512 --max_epochs 1 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
Command: 'cd /data/chenhao/codes/ColossalAI/applications/Chat/examples && export LC_PAPER="zh_CN.UTF-8" XDG_SESSION_ID="16846" LC_ADDRESS="zh_CN.UTF-8" LC_MONETARY="zh_CN.UTF-8" SHELL="/bin/bash" TERM="xterm-256color" SSH_CLIENT="10.248.30.236 60180 22" CONDA_SHLVL="3" CONDA_PROMPT_MODIFIER="(ColossalAI-Chat) " LC_NUMERIC="zh_CN.UTF-8" OLDPWD="/data/chenhao/codes/ColossalAI/applications/Chat" SSH_TTY="/dev/pts/52" USER="chenhao" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" LC_TELEPHONE="zh_CN.UTF-8" CONDA_EXE="/data/chenhao/anaconda3/bin/conda" CONDA_PREFIX_1="/data/chenhao/anaconda3" PATH="/data/chenhao/anaconda3/envs/ColossalAI-Chat/bin:/data/chenhao/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" MAIL="/var/mail/chenhao" CONDA_PREFIX_2="/data/chenhao/anaconda3/envs/ColossalAI" QT_QPA_PLATFORMTHEME="appmenu-qt5" CONDA_PREFIX="/data/chenhao/anaconda3/envs/ColossalAI-Chat" LC_IDENTIFICATION="zh_CN.UTF-8" PWD="/data/chenhao/codes/ColossalAI/applications/Chat/examples" LANG="en_US.UTF-8" LC_MEASUREMENT="zh_CN.UTF-8" HOME="/home/chenhao" SHLVL="1" CONDA_PYTHON_EXE="/data/chenhao/anaconda3/bin/python" LOGNAME="chenhao" SSH_CONNECTION="10.248.30.236 60180 10.248.33.108 22" XDG_DATA_DIRS="/usr/local/share:/usr/share:/var/lib/snapd/desktop" CONDA_DEFAULT_ENV="ColossalAI-Chat" LESSOPEN="| /usr/bin/lesspipe %s" XDG_RUNTIME_DIR="/run/user/990" LESSCLOSE="/usr/bin/lesspipe %s %s" LC_TIME="zh_CN.UTF-8" LC_NAME="zh_CN.UTF-8" _="/data/chenhao/anaconda3/envs/ColossalAI-Chat/bin/colossalai" && torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_sft.py --pretrain /data/chenhao/train/ColossalAI/to/llama-7b-hf/ --model llama --strategy colossalai_zero2 --log_interval 10 --save_path /data/chenhao/train/ColossalAI/Coati-7B --dataset /data/chenhao/train/ColossalAI/data.json --batch_size 4 --accimulation_steps 8 --lr 2e-5 --max_datasets_size 512 --max_epochs 1'
Exit code: 1
Stdout: already printed
Stderr: already printed
====== Training on All Nodes =====
127.0.0.1: failure
====== Stopping All Nodes =====
127.0.0.1: finish
Environment
CUDA:11.2 Python:1.30 PyTorch:1.13.1
GeForce RTX 2080ti * 4 * 11G
Very similar error. where is the solution:(
me too!! why! when i run code on 7 or 8 gpu, it runs the same error as you! but when i run on 6 gpu, it successes! i am very puzzled
Hi, @twwch @Pe0p1e024 @HaixHan have you check with the memory usage during the training? Sometimes CPU running out of memory may lead to this. You can try allocating more main memory and rerun it.
I followed this process again and successfully ran a smaller base model. You can refer to my blog for more information ColossalAI-Chat训练手册(RLHF)