ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Open MikeDean2367 opened this issue 2 years ago • 0 comments

🐛 Describe the bug

The script file I run is gpt/gemini/run_gemini.sh, which runs on 2 GPU. The rest of the code was unchanged. The model i used is gpt 7b. TP=1

def gpt2_7b(checkpoint=True):
    return GPTLMModel(hidden_size=4096, num_layers=35, num_attention_heads=16, checkpoint=checkpoint)

The following information is an error. I observed the moment when the error occurred, and the memory usage was already 100%. The memory of service is 128GB. I don't know why there is an error at the 3rd epoch.

ColossalAI/examples/language/gpt/gemini$ bash run_gemini.sh 
+ export DISTPLAN=CAI_Gemini
+ DISTPLAN=CAI_Gemini
+ export CUDA_VISIBLE_DEVICES=0,1
+ CUDA_VISIBLE_DEVICES=0,1
+ export GPUNUM=2
+ GPUNUM=2
+ export TPDEGREE=1
+ TPDEGREE=1
+ export PLACEMENT=cpu
+ PLACEMENT=cpu
+ export USE_SHARD_INIT=False
+ USE_SHARD_INIT=False
+ export BATCH_SIZE=16
+ BATCH_SIZE=16
+ export MODEL_TYPE=gpt2_7b
+ MODEL_TYPE=gpt2_7b
+ export TRAIN_STEP=1000
+ TRAIN_STEP=1000
+ '[' False = True ']'
+ USE_SHARD_INIT=
+ mkdir -p gemini_logs
+ torchrun --standalone --nproc_per_node=2 ./train_gpt_demo.py --tp_degree=1 --model_type=gpt2_7b --batch_size=16 --placement=cpu --distplan=CAI_Gemini --train_step=1000
+ tee ./gemini_logs/gpt2_7b_CAI_Gemini_gpu_2_bs_16_tp_1_cpu.log
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
environmental variable OMP_NUM_THREADS is set to 40.
environmental variable OMP_NUM_THREADS is set to 40.
[03/17/23 20:00:13] INFO     colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device                              
[03/17/23 20:00:13] INFO     colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device                              
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                                                                                         
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                                                                                         
[03/17/23 20:00:18] INFO     colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed                                
[03/17/23 20:00:18] INFO     colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed                                
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed  
                             is ParallelMode.DATA.                                                                                                                                                       
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed  
                             is ParallelMode.DATA.                                                                                                                                                       
                    INFO     colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/initialize.py:116 launch                                                
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 2, pipeline parallel size: 1, tensor parallel size: 1                           
                    INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:211 main                                                     
                    INFO     colossalai - colossalai - INFO: gpt2_7b, CAI_Gemini, batch size 16                                                                                                          
[extension] OP colossalai._C.cpu_adam has been compileed ahead of time, skip building.
[extension] OP colossalai._C.fused_optim has been compileed ahead of time, skip building.
searching chunk configuration is completed in 0.89 s.
used number: 6922.10 MB, wasted number: 60.29 MB
total wasted percentage is 0.86%
[03/17/23 20:01:54] INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:270 main                                                     
                    INFO     colossalai - colossalai - INFO: After init optim, GPU memory usage: 35.02 MB, CPU memory usage: 43108.87 MB                                                                 
                    INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:285 main                                                     
                    INFO     colossalai - colossalai - INFO: the size of testing model size is 7.5B.                                                                                                     
                    INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:286 main                                                     
                    INFO     colossalai - colossalai - INFO: After init model, GPU memory usage: 35.02 MB, CPU memory usage: 43108.89 MB                                                                 
[03/17/23 20:01:54] INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:285 main                                                     
                    INFO     colossalai - colossalai - INFO: the size of testing model size is 7.5B.                                                                                                     
[03/17/23 20:02:02] INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:308 train_step                                               
                    INFO     colossalai - colossalai - INFO: [1/1000] Forward GPU memory usage: 8367.72 MB, CPU memory usage: 44092.62 MB                                                                
[03/17/23 20:02:20] INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:320 train_step                                               
                    INFO     colossalai - colossalai - INFO: [1/1000] Backward GPU memory usage: 1605.80 MB, CPU memory usage: 44096.31 MB                                                               
                    INFO     colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/nn/optimizer/zero_optimizer.py:218 step                                 
[03/17/23 20:02:20] INFO     colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/nn/optimizer/zero_optimizer.py:218 step                                 
                    INFO     colossalai - colossalai - INFO: Found overflow. Skip step                                                                                                                   
                    INFO     colossalai - colossalai - INFO: Found overflow. Skip step                                                                                                                   
                    INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:326 train_step                                               
                    INFO     colossalai - colossalai - INFO: [1/1000] Optimizer step GPU memory usage: 1605.80 MB, CPU memory usage: 44096.32 MB                                                         
                    INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:329 train_step                                               
                    INFO     colossalai - colossalai - INFO: [1/1000] Loss:11.617, Step time: 25.792s, TFLOPS: 37.932, FWD time: 7.712s, BWD time: 18.063s, OPTIM time: 0.017s                           
[03/17/23 20:02:28] INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:308 train_step                                               
                    INFO     colossalai - colossalai - INFO: [2/1000] Forward GPU memory usage: 8367.72 MB, CPU memory usage: 44096.42 MB                                                                
[03/17/23 20:02:46] INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:320 train_step                                               
                    INFO     colossalai - colossalai - INFO: [2/1000] Backward GPU memory usage: 1605.80 MB, CPU memory usage: 44096.64 MB                                                               
                    INFO     colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/nn/optimizer/zero_optimizer.py:218 step                                 
                    INFO     colossalai - colossalai - INFO: Found overflow. Skip step                                                                                                                   
[03/17/23 20:02:46] INFO     colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/nn/optimizer/zero_optimizer.py:218 step                                 
                    INFO     colossalai - colossalai - INFO: Found overflow. Skip step                                                                                                                   
                    INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:326 train_step                                               
                    INFO     colossalai - colossalai - INFO: [2/1000] Optimizer step GPU memory usage: 1605.80 MB, CPU memory usage: 44096.93 MB                                                         
                    INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:329 train_step                                               
                    INFO     colossalai - colossalai - INFO: [2/1000] Loss:11.633, Step time: 25.643s, TFLOPS: 38.153, FWD time: 7.591s, BWD time: 18.030s, OPTIM time: 0.022s                           
[03/17/23 20:02:54] INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:308 train_step                                               
                    INFO     colossalai - colossalai - INFO: [3/1000] Forward GPU memory usage: 8367.72 MB, CPU memory usage: 44096.98 MB                                                                
[03/17/23 20:03:12] INFO     colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:320 train_step                                               
                    INFO     colossalai - colossalai - INFO: [3/1000] Backward GPU memory usage: 1605.80 MB, CPU memory usage: 44097.03 MB                                                               
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2410413 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 2410412) of binary: /data/zjt/anaconda3/envs/cai/bin/python
Traceback (most recent call last):
  File "/data/zjt/anaconda3/envs/cai/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
./train_gpt_demo.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-17_20:04:46
  host      : code
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 2410412)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 2410412
========================================================

Environment

pytorch==1.12.0 cuda==11.6 python==3.9

MikeDean2367 avatar Mar 17 '23 13:03 MikeDean2367