ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 812917) of binary

Open twwch opened this issue 1 year ago • 4 comments

🐛 Describe the bug

colossalai run --nproc_per_node=4 train_sft.py \
> --pretrain "/data/chenhao/train/ColossalAI/to/llama-7b-hf/" \
> --model 'llama' \
> --strategy colossalai_zero2 \
> --log_interval 10 \
> --save_path  "/data/chenhao/train/ColossalAI/Coati-7B" \
> --dataset "/data/chenhao/train/ColossalAI/data.json" \
> --batch_size 4 \
> --accimulation_steps 8 \
> --lr 2e-5 \
> --max_datasets_size 512 \
> --max_epochs 1 
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[04/10/23 16:14:26] INFO     colossalai - colossalai - INFO:                    
                             /data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
                             thon3.10/site-packages/colossalai/context/parallel_
                             context.py:522 set_device                          
[04/10/23 16:14:26] INFO     colossalai - colossalai - INFO:                    
                             /data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
                             thon3.10/site-packages/colossalai/context/parallel_
                             context.py:522 set_device                          
[04/10/23 16:14:26] INFO     colossalai - colossalai - INFO:                    
                             /data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
                             thon3.10/site-packages/colossalai/context/parallel_
                             context.py:522 set_device                          
                    INFO     colossalai - colossalai - INFO: process rank 2 is  
                             bound to device 2                                  
                    INFO     colossalai - colossalai - INFO: process rank 0 is  
                             bound to device 0                                  
                    INFO     colossalai - colossalai - INFO: process rank 1 is  
                             bound to device 1                                  
[04/10/23 16:14:26] INFO     colossalai - colossalai - INFO:                    
                             /data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
                             thon3.10/site-packages/colossalai/context/parallel_
                             context.py:522 set_device                          
                    INFO     colossalai - colossalai - INFO: process rank 3 is  
                             bound to device 3                                  
[04/10/23 16:14:35] INFO     colossalai - colossalai - INFO:                    
                             /data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
                             thon3.10/site-packages/colossalai/context/parallel_
                             context.py:558 set_seed                            
                    INFO     colossalai - colossalai - INFO: initialized seed on
                             rank 1, numpy: 42, python random: 42,              
                             ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the 
                             default parallel seed is ParallelMode.DATA.        
[04/10/23 16:14:35] INFO     colossalai - colossalai - INFO:                    
                             /data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
                             thon3.10/site-packages/colossalai/context/parallel_
                             context.py:558 set_seed                            
[04/10/23 16:14:35] INFO     colossalai - colossalai - INFO:                    
                             /data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
                             thon3.10/site-packages/colossalai/context/parallel_
                             context.py:558 set_seed                            
                    INFO     colossalai - colossalai - INFO: initialized seed on
                             rank 2, numpy: 42, python random: 42,              
                             ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the 
                             default parallel seed is ParallelMode.DATA.        
                    INFO     colossalai - colossalai - INFO: initialized seed on
                             rank 3, numpy: 42, python random: 42,              
                             ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the 
                             default parallel seed is ParallelMode.DATA.        
[04/10/23 16:14:35] INFO     colossalai - colossalai - INFO:                    
                             /data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
                             thon3.10/site-packages/colossalai/context/parallel_
                             context.py:558 set_seed                            
                    INFO     colossalai - colossalai - INFO: initialized seed on
                             rank 0, numpy: 42, python random: 42,              
                             ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the 
                             default parallel seed is ParallelMode.DATA.        
                    INFO     colossalai - colossalai - INFO:                    
                             /data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/py
                             thon3.10/site-packages/colossalai/initialize.py:115
                              launch                                            
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, data parallel size: 4, 
                             pipeline parallel size: 1, tensor parallel size: 1 
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 812914 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 812915 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 812916 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 812917) of binary: /data/chenhao/anaconda3/envs/ColossalAI-Chat/bin/python
Traceback (most recent call last):
  File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/chenhao/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train_sft.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-10_16:15:28
  host      : xd-dev8
  rank      : 3 (local_rank: 3)
  exitcode  : -9 (pid: 812917)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 812917
=======================================================
Error: failed to run torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_sft.py --pretrain /data/chenhao/train/ColossalAI/to/llama-7b-hf/ --model llama --strategy colossalai_zero2 --log_interval 10 --save_path /data/chenhao/train/ColossalAI/Coati-7B --dataset /data/chenhao/train/ColossalAI/data.json --batch_size 4 --accimulation_steps 8 --lr 2e-5 --max_datasets_size 512 --max_epochs 1 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /data/chenhao/codes/ColossalAI/applications/Chat/examples && export LC_PAPER="zh_CN.UTF-8" XDG_SESSION_ID="16846" LC_ADDRESS="zh_CN.UTF-8" LC_MONETARY="zh_CN.UTF-8" SHELL="/bin/bash" TERM="xterm-256color" SSH_CLIENT="10.248.30.236 60180 22" CONDA_SHLVL="3" CONDA_PROMPT_MODIFIER="(ColossalAI-Chat) " LC_NUMERIC="zh_CN.UTF-8" OLDPWD="/data/chenhao/codes/ColossalAI/applications/Chat" SSH_TTY="/dev/pts/52" USER="chenhao" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" LC_TELEPHONE="zh_CN.UTF-8" CONDA_EXE="/data/chenhao/anaconda3/bin/conda" CONDA_PREFIX_1="/data/chenhao/anaconda3" PATH="/data/chenhao/anaconda3/envs/ColossalAI-Chat/bin:/data/chenhao/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" MAIL="/var/mail/chenhao" CONDA_PREFIX_2="/data/chenhao/anaconda3/envs/ColossalAI" QT_QPA_PLATFORMTHEME="appmenu-qt5" CONDA_PREFIX="/data/chenhao/anaconda3/envs/ColossalAI-Chat" LC_IDENTIFICATION="zh_CN.UTF-8" PWD="/data/chenhao/codes/ColossalAI/applications/Chat/examples" LANG="en_US.UTF-8" LC_MEASUREMENT="zh_CN.UTF-8" HOME="/home/chenhao" SHLVL="1" CONDA_PYTHON_EXE="/data/chenhao/anaconda3/bin/python" LOGNAME="chenhao" SSH_CONNECTION="10.248.30.236 60180 10.248.33.108 22" XDG_DATA_DIRS="/usr/local/share:/usr/share:/var/lib/snapd/desktop" CONDA_DEFAULT_ENV="ColossalAI-Chat" LESSOPEN="| /usr/bin/lesspipe %s" XDG_RUNTIME_DIR="/run/user/990" LESSCLOSE="/usr/bin/lesspipe %s %s" LC_TIME="zh_CN.UTF-8" LC_NAME="zh_CN.UTF-8" _="/data/chenhao/anaconda3/envs/ColossalAI-Chat/bin/colossalai" && torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_sft.py --pretrain /data/chenhao/train/ColossalAI/to/llama-7b-hf/ --model llama --strategy colossalai_zero2 --log_interval 10 --save_path /data/chenhao/train/ColossalAI/Coati-7B --dataset /data/chenhao/train/ColossalAI/data.json --batch_size 4 --accimulation_steps 8 --lr 2e-5 --max_datasets_size 512 --max_epochs 1'

Exit code: 1

Stdout: already printed

Stderr: already printed



====== Training on All Nodes =====
127.0.0.1: failure

====== Stopping All Nodes =====
127.0.0.1: finish

Environment

CUDA:11.2 Python:1.30 PyTorch:1.13.1

GeForce RTX 2080ti * 4 * 11G

twwch avatar Apr 10 '23 08:04 twwch

Very similar error. where is the solution:(

Pe0p1e024 avatar Apr 12 '23 09:04 Pe0p1e024

me too!! why! when i run code on 7 or 8 gpu, it runs the same error as you! but when i run on 6 gpu, it successes! i am very puzzled

HaixHan avatar Apr 12 '23 13:04 HaixHan

Hi, @twwch @Pe0p1e024 @HaixHan have you check with the memory usage during the training? Sometimes CPU running out of memory may lead to this. You can try allocating more main memory and rerun it.

Camille7777 avatar Apr 20 '23 08:04 Camille7777

I followed this process again and successfully ran a smaller base model. You can refer to my blog for more information ColossalAI-Chat训练手册(RLHF)

twwch avatar Apr 21 '23 02:04 twwch