ColossalAI
ColossalAI copied to clipboard
[BUG]: torch.distributed.elastic.multiprocessing.errors.ChildFailedError
🐛 Describe the bug
The script file I run is gpt/gemini/run_gemini.sh, which runs on 2 GPU. The rest of the code was unchanged. The model i used is gpt 7b. TP=1
def gpt2_7b(checkpoint=True):
return GPTLMModel(hidden_size=4096, num_layers=35, num_attention_heads=16, checkpoint=checkpoint)
The following information is an error. I observed the moment when the error occurred, and the memory usage was already 100%. The memory of service is 128GB. I don't know why there is an error at the 3rd epoch.
ColossalAI/examples/language/gpt/gemini$ bash run_gemini.sh
+ export DISTPLAN=CAI_Gemini
+ DISTPLAN=CAI_Gemini
+ export CUDA_VISIBLE_DEVICES=0,1
+ CUDA_VISIBLE_DEVICES=0,1
+ export GPUNUM=2
+ GPUNUM=2
+ export TPDEGREE=1
+ TPDEGREE=1
+ export PLACEMENT=cpu
+ PLACEMENT=cpu
+ export USE_SHARD_INIT=False
+ USE_SHARD_INIT=False
+ export BATCH_SIZE=16
+ BATCH_SIZE=16
+ export MODEL_TYPE=gpt2_7b
+ MODEL_TYPE=gpt2_7b
+ export TRAIN_STEP=1000
+ TRAIN_STEP=1000
+ '[' False = True ']'
+ USE_SHARD_INIT=
+ mkdir -p gemini_logs
+ torchrun --standalone --nproc_per_node=2 ./train_gpt_demo.py --tp_degree=1 --model_type=gpt2_7b --batch_size=16 --placement=cpu --distplan=CAI_Gemini --train_step=1000
+ tee ./gemini_logs/gpt2_7b_CAI_Gemini_gpu_2_bs_16_tp_1_cpu.log
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
environmental variable OMP_NUM_THREADS is set to 40.
environmental variable OMP_NUM_THREADS is set to 40.
[03/17/23 20:00:13] INFO colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
[03/17/23 20:00:13] INFO colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[03/17/23 20:00:18] INFO colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed
[03/17/23 20:00:18] INFO colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed
is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed
is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 2, pipeline parallel size: 1, tensor parallel size: 1
INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:211 main
INFO colossalai - colossalai - INFO: gpt2_7b, CAI_Gemini, batch size 16
[extension] OP colossalai._C.cpu_adam has been compileed ahead of time, skip building.
[extension] OP colossalai._C.fused_optim has been compileed ahead of time, skip building.
searching chunk configuration is completed in 0.89 s.
used number: 6922.10 MB, wasted number: 60.29 MB
total wasted percentage is 0.86%
[03/17/23 20:01:54] INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:270 main
INFO colossalai - colossalai - INFO: After init optim, GPU memory usage: 35.02 MB, CPU memory usage: 43108.87 MB
INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:285 main
INFO colossalai - colossalai - INFO: the size of testing model size is 7.5B.
INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:286 main
INFO colossalai - colossalai - INFO: After init model, GPU memory usage: 35.02 MB, CPU memory usage: 43108.89 MB
[03/17/23 20:01:54] INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:285 main
INFO colossalai - colossalai - INFO: the size of testing model size is 7.5B.
[03/17/23 20:02:02] INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:308 train_step
INFO colossalai - colossalai - INFO: [1/1000] Forward GPU memory usage: 8367.72 MB, CPU memory usage: 44092.62 MB
[03/17/23 20:02:20] INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:320 train_step
INFO colossalai - colossalai - INFO: [1/1000] Backward GPU memory usage: 1605.80 MB, CPU memory usage: 44096.31 MB
INFO colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/nn/optimizer/zero_optimizer.py:218 step
[03/17/23 20:02:20] INFO colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/nn/optimizer/zero_optimizer.py:218 step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:326 train_step
INFO colossalai - colossalai - INFO: [1/1000] Optimizer step GPU memory usage: 1605.80 MB, CPU memory usage: 44096.32 MB
INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:329 train_step
INFO colossalai - colossalai - INFO: [1/1000] Loss:11.617, Step time: 25.792s, TFLOPS: 37.932, FWD time: 7.712s, BWD time: 18.063s, OPTIM time: 0.017s
[03/17/23 20:02:28] INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:308 train_step
INFO colossalai - colossalai - INFO: [2/1000] Forward GPU memory usage: 8367.72 MB, CPU memory usage: 44096.42 MB
[03/17/23 20:02:46] INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:320 train_step
INFO colossalai - colossalai - INFO: [2/1000] Backward GPU memory usage: 1605.80 MB, CPU memory usage: 44096.64 MB
INFO colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/nn/optimizer/zero_optimizer.py:218 step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
[03/17/23 20:02:46] INFO colossalai - colossalai - INFO: /data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/colossalai/nn/optimizer/zero_optimizer.py:218 step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:326 train_step
INFO colossalai - colossalai - INFO: [2/1000] Optimizer step GPU memory usage: 1605.80 MB, CPU memory usage: 44096.93 MB
INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:329 train_step
INFO colossalai - colossalai - INFO: [2/1000] Loss:11.633, Step time: 25.643s, TFLOPS: 38.153, FWD time: 7.591s, BWD time: 18.030s, OPTIM time: 0.022s
[03/17/23 20:02:54] INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:308 train_step
INFO colossalai - colossalai - INFO: [3/1000] Forward GPU memory usage: 8367.72 MB, CPU memory usage: 44096.98 MB
[03/17/23 20:03:12] INFO colossalai - colossalai - INFO: /data/zjt/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:320 train_step
INFO colossalai - colossalai - INFO: [3/1000] Backward GPU memory usage: 1605.80 MB, CPU memory usage: 44097.03 MB
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2410413 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 2410412) of binary: /data/zjt/anaconda3/envs/cai/bin/python
Traceback (most recent call last):
File "/data/zjt/anaconda3/envs/cai/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/zjt/anaconda3/envs/cai/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
./train_gpt_demo.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-17_20:04:46
host : code
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 2410412)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 2410412
========================================================
Environment
pytorch==1.12.0 cuda==11.6 python==3.9