ColossalAI
ColossalAI copied to clipboard
[BUG]: Program is Blocked when I training GPT with PP=4 on 4GPUs in titans.
🐛 Describe the bug
Hi, It works normally with PP=2 on 2GPUs. Refer to another question https://github.com/hpcaitech/ColossalAI/issues/2535 But, it seems blocked when I run with PP=4 on 4GPUs, and the GPU-util is 100%
[02/24/23 15:07:44] INFO colossalai - colossalai - INFO: /root/pkgs/py37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
[02/24/23 15:07:44] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[02/24/23 15:07:44] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[02/24/23 15:07:44] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
[02/24/23 15:07:49] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:557 set_seed
[02/24/23 15:07:49] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:557 set_seed
[02/24/23 15:07:49] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:557 set_seed
[02/24/23 15:07:49] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2048,the default parallel
seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 3072,the default parallel
seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel
seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 4096,the default parallel
seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/initialize.py:120 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 4, tensor parallel size: 1
INFO colossalai - colossalai - INFO: train_gpt.py:46 main
INFO colossalai - colossalai - INFO: Build data loader from path None
INFO colossalai - colossalai - INFO: train_gpt.py:56 main
INFO colossalai - colossalai - INFO: Build model
INFO colossalai - colossalai - INFO: /root/gpt/titans/model/pipeline_gpt1d.py:242 _build_generic_gpt_pipeline_1d
INFO colossalai - colossalai - INFO: Rank0 build layer 0-3, 3/12 layers
INFO colossalai - colossalai - INFO: /root/gpt/titans/model/pipeline_gpt1d.py:242 _build_generic_gpt_pipeline_1d
INFO colossalai - colossalai - INFO: /root/gpt/titans/model/pipeline_gpt1d.py:242 _build_generic_gpt_pipeline_1d
INFO colossalai - colossalai - INFO: Rank2 build layer 6-9, 3/12 layers
INFO colossalai - colossalai - INFO: Rank1 build layer 3-6, 3/12 layers
INFO colossalai - colossalai - INFO: /root/gpt/titans/model/pipeline_gpt1d.py:242 _build_generic_gpt_pipeline_1d
INFO colossalai - colossalai - INFO: Rank3 build layer 9-12, 3/12 layers
INFO colossalai - colossalai - INFO: /root/gpt/titans/model/pipeline_gpt1d.py:259 _build_generic_gpt_pipeline_1d
INFO colossalai - colossalai - INFO: Rank2/2 model size = 0.042527232 GB
INFO colossalai - colossalai - INFO: /root/gpt/titans/model/pipeline_gpt1d.py:259 _build_generic_gpt_pipeline_1d
INFO colossalai - colossalai - INFO: Rank1/1 model size = 0.042527232 GB
[02/24/23 15:07:50] INFO colossalai - colossalai - INFO: /root/gpt/titans/model/pipeline_gpt1d.py:259 _build_generic_gpt_pipeline_1d
INFO colossalai - colossalai - INFO: Rank0/0 model size = 0.12136704 GB
[02/24/23 15:07:50] INFO colossalai - colossalai - INFO: /root/titans/model/pipeline_gpt1d.py:259 _build_generic_gpt_pipeline_1d
INFO colossalai - colossalai - INFO: Rank3/3 model size = 0.119797248 GB
INFO colossalai - colossalai - INFO: train_gpt.py:84 main
INFO colossalai - colossalai - INFO: Build optimizer
OP colossalai._C.cpu_adam already exists, skip building.
Time to load cpu_adam op: 0.001405954360961914 seconds
OP colossalai._C.fused_optim already exists, skip building.
Time to load fused_optim op: 5.14984130859375e-05 seconds
[02/24/23 15:07:51] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/initialize.py:269 initialize
INFO colossalai - colossalai - INFO:
========== Your Config ========
{'BATCH_SIZE': 8,
'GPT2_small_pipeline_hybrid': <function GPT2_small_pipeline_hybrid at 0x7f15d77b88c0>,
'HIDDEN_SIZE': 768,
'NUM_EPOCHS': 10,
'NUM_MICRO_BATCHES': 4,
'SEQ_LEN': 1024,
'TENSOR_SHAPE': (2, 1024, 768),
'model': {'checkpoint': True, 'num_chunks': 1},
'optimizer': {'lr': 1.5e-05, 'weight_decay': 0.01},
'parallel': {'pipeline': 4, 'tensor': {'mode': '1d', 'size': 1}},
'zero': {'model_config': {'shard_strategy': <colossalai.zero.shard_utils.tensor_shard_strategy.TensorShardStrategy object at 0x7f15d779f050>,
'tensor_placement_policy': 'cuda'},
'optimizer_config': {'initial_scale': 32}}}
================================
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/initialize.py:277 initialize
INFO colossalai - colossalai - INFO: cuDNN benchmark = False, deterministic = False
[02/24/23 15:07:51] INFO colossalai - convert_to_zero_v2 - INFO: /root/py37/lib/python3.7/site-packages/colossalai/zero/__init__.py:29 convert_to_zero_v2
INFO colossalai - convert_to_zero_v2 - INFO: optimizer_config is {'initial_scale': 32}
INFO colossalai - convert_to_zero_v2 - INFO: /root/py37/lib/python3.7/site-packages/colossalai/zero/__init__.py:32 convert_to_zero_v2
INFO colossalai - convert_to_zero_v2 - INFO: model_config is {'tensor_placement_policy': 'cuda', 'shard_strategy':
<colossalai.zero.shard_utils.tensor_shard_strategy.TensorShardStrategy object at 0x7f15d779f050>}
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/initialize.py:294 initialize
INFO colossalai - colossalai - INFO: Initializing ZeRO model and optimizer finished!
WARNING colossalai - colossalai - WARNING: /root/py37/lib/python3.7/site-packages/colossalai/initialize.py:318 initialize
WARNING colossalai - colossalai - WARNING: The parameters of models is not automatically synchronized.
Please make sure that all parameters are the same in data parallel group.
[02/24/23 15:07:51] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/utils/memory.py:91 report_memory_usage
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/initialize.py:362 initialize
INFO colossalai - colossalai - INFO: Before-train: GPU: allocated 344.25 MB, max allocated 344.25 MB, cached: 432.0 MB, max cached: 432.0 MB
[02/24/23 15:07:51] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/utils/memory.py:91 report_memory_usage
INFO colossalai - colossalai - INFO: Training with zero is detected, ZeROGradientHandler is automatically added even though not specified in the configuration
INFO colossalai - colossalai - INFO: Before-train: GPU: allocated 122.55 MB, max allocated 122.55 MB, cached: 142.0 MB, max cached: 142.0 MB
[Epoch 0 / Train]: 0%| | 0/1280 [00:00<?, ?it/s][02/24/23 15:07:51] INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/utils/memory.py:91 report_memory_usage
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/initialize.py:404 initialize
INFO colossalai - colossalai - INFO: Before-train: GPU: allocated 122.55 MB, max allocated 122.55 MB, cached: 142.0 MB, max cached: 142.0 MB
INFO colossalai - colossalai - INFO: pipeline_shared_module is detected, PipelineSharedModuleGradientHandler is automatically added even though not specified in the
configuration
INFO colossalai - colossalai - INFO: train_gpt.py:100 main
INFO colossalai - colossalai - INFO: Init done, global batch size = 8
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/trainer/_trainer.py:306 fit
INFO colossalai - colossalai - INFO: Using LossHook for training, priority = 0
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/trainer/_trainer.py:306 fit
INFO colossalai - colossalai - INFO: Using LRSchedulerHook for training, priority = 1
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/trainer/_trainer.py:306 fit
INFO colossalai - colossalai - INFO: Using LogMetricByEpochHook for training, priority = 10
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/trainer/_trainer.py:306 fit
INFO colossalai - colossalai - INFO: Using ThroughputHook for training, priority = 10
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/trainer/_trainer.py:306 fit
INFO colossalai - colossalai - INFO: Using LogMetricByStepHook for training, priority = 10
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/trainer/_trainer.py:306 fit
INFO colossalai - colossalai - INFO: Using LogMemoryByEpochHook for training, priority = 10
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/trainer/_trainer.py:306 fit
INFO colossalai - colossalai - INFO: Using SaveCheckpointHook for training, priority = 10
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/trainer/_trainer.py:308 fit
INFO colossalai - colossalai - INFO: Lower value means higher priority for calling hook function
INFO colossalai - colossalai - INFO: /root/py37/lib/python3.7/site-packages/colossalai/utils/memory.py:91 report_memory_usage
INFO colossalai - colossalai - INFO: Before-train: GPU: allocated 347.86 MB, max allocated 347.86 MB, cached: 372.0 MB, max cached: 372.0 MB
^C
Aborted!
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 148376 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 148377 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 148378 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 148379 closing signal SIGINT
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/root/py37/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
Process Process-1:
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 148376 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 148377 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 148378 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 148379 closing signal SIGTERM
Traceback (most recent call last):
File "/root/py37/lib/python3.7/site-packages/invoke/runners.py", line 447, in _finish
self.wait()
File "/root/py37/lib/python3.7/site-packages/invoke/runners.py", line 964, in wait
time.sleep(self.input_sleep)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/py37/lib/python3.7/site-packages/invoke/runners.py", line 454, in _finish
self.send_interrupt(e)
File "/root/py37/lib/python3.7/site-packages/invoke/runners.py", line 1115, in send_interrupt
self.write_proc_stdin("\x03")
File "/root/py37/lib/python3.7/site-packages/invoke/runners.py", line 978, in write_proc_stdin
self._write_proc_stdin(data.encode(self.encoding))
File "/root/py37/lib/python3.7/site-packages/invoke/runners.py", line 1234, in _write_proc_stdin
fd = self.parent_fd if self.using_pty else self.process.stdin.fileno()
ValueError: I/O operation on closed file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/root/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/root/py37/lib/python3.7/site-packages/colossalai/cli/launcher/multinode_runner.py", line 45, in run_on_host
fab_conn.local(cmds, hide=False)
File "/root/py37/lib/python3.7/site-packages/fabric/connection.py", line 846, in local
return super().run(*args, **kwargs)
File "/root/py37/lib/python3.7/site-packages/invoke/context.py", line 91, in run
return self._run(runner, command, **kwargs)
File "/root/py37/lib/python3.7/site-packages/invoke/context.py", line 98, in _run
return runner.run(command, **kwargs)
File "/root/py37/lib/python3.7/site-packages/invoke/runners.py", line 376, in run
return self._run_body(command, **kwargs)
File "/root/py37/lib/python3.7/site-packages/invoke/runners.py", line 432, in _run_body
return self.make_promise() if self._asynchronous else self._finish()
File "/root/py37/lib/python3.7/site-packages/invoke/runners.py", line 470, in _finish
thread.join(self._thread_join_timeout(target))
File "/root/py37/lib/python3.7/threading.py", line 1044, in join
self._wait_for_tstate_lock()
File "/root/py37/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
Environment
No response
Hi @ErenChan , for your case of PP=4 on 4 GPUs, zero and tensor parallelism is no longer necessary since your model is all pipelined.
This issue was closed due to inactivity. Thanks.