ColossalAI
ColossalAI copied to clipboard
[BUG]: colossalai stucked with GPU1,2 while running with GPU0,1
🐛 Describe the bug
code at github
expected
run as expected with GPU0,1(connected with nvlink):
$ CUDA_VISIBLE_DEVICES=0,1 colossalai run --nproc_per_node 2 colossal.py
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
warnings.warn("`config` is deprecated and will be removed soon.")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
[12/11/23 17:38:00] INFO colossalai - colossalai - INFO:
/home/ices/miniconda3/envs/colossalai/lib/python3.1
0/site-packages/colossalai/initialize.py:63 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, world size: 2
Epoch [1/80]: 100%|██████████| 50/50 [00:26<00:00, 1.91it/s, loss=-]
Epoch [2/80]: 96%|█████████▌| 48/50 [00:21<00:00, 2.27it/s, loss=-]
Mon Dec 11 17:38:22 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:4F:00.0 Off | Off |
| 30% 43C P2 138W / 300W| 9751MiB / 49140MiB | 65% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 On | 00000000:52:00.0 Off | Off |
| 30% 45C P2 78W / 300W| 9751MiB / 49140MiB | 34% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A6000 On | 00000000:56:00.0 Off | Off |
| 30% 29C P8 26W / 300W| 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A6000 On | 00000000:57:00.0 Off | Off |
| 30% 29C P8 19W / 300W| 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA RTX A6000 On | 00000000:CE:00.0 Off | Off |
| 30% 28C P8 26W / 300W| 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA RTX A6000 On | 00000000:D1:00.0 Off | Off |
| 30% 29C P8 24W / 300W| 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 92018 C ...niconda3/envs/colossalai/bin/python 9748MiB |
| 1 N/A N/A 92019 C ...niconda3/envs/colossalai/bin/python 9748MiB |
+---------------------------------------------------------------------------------------+
unexpected
stuck when using GPU1,2(no nvlink between them):
$ CUDA_VISIBLE_DEVICES=1,2 colossalai run --nproc_per_node 2 colossal.py
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
warnings.warn("`config` is deprecated and will be removed soon.")
[12/11/23 17:39:34] INFO colossalai - colossalai - INFO:
/home/ices/miniconda3/envs/colossalai/lib/python3.1
0/site-packages/colossalai/initialize.py:63 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, world size: 2
Mon Dec 11 17:39:49 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:4F:00.0 Off | Off |
| 30% 31C P8 25W / 300W| 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 On | 00000000:52:00.0 Off | Off |
| 30% 38C P2 85W / 300W| 1089MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A6000 On | 00000000:56:00.0 Off | Off |
| 30% 34C P2 97W / 300W| 1089MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A6000 On | 00000000:57:00.0 Off | Off |
| 30% 29C P8 19W / 300W| 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA RTX A6000 On | 00000000:CE:00.0 Off | Off |
| 30% 28C P8 26W / 300W| 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA RTX A6000 On | 00000000:D1:00.0 Off | Off |
| 30% 29C P8 24W / 300W| 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 1 N/A N/A 92238 C ...niconda3/envs/colossalai/bin/python 1086MiB |
| 2 N/A N/A 92239 C ...niconda3/envs/colossalai/bin/python 1086MiB |
+---------------------------------------------------------------------------------------+
GPU topo
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 NIC0 NIC1 CPU Affinity NUMA Affinity
GPU0 X NV4 PXB PXB SYS SYS NODE NODE 0-11,24-35 0
GPU1 NV4 X PXB PXB SYS SYS NODE NODE 0-11,24-35 0
GPU2 PXB PXB X NV4 SYS SYS NODE NODE 0-11,24-35 0
GPU3 PXB PXB NV4 X SYS SYS NODE NODE 0-11,24-35 0
GPU4 SYS SYS SYS SYS X NV4 SYS SYS 12-23,36-47 1
GPU5 SYS SYS SYS SYS NV4 X SYS SYS 12-23,36-47 1
NIC0 NODE NODE NODE NODE SYS SYS X PIX
NIC1 NODE NODE NODE NODE SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
Environment
nvidia-driver: 530.30.02
python=3.10.13 pytorch==1.13.1 pytorch-cuda=11.7 cuda-toolkit=11.7.1 colossalai=0.3.4
Hi, can you identify where it is getting stuck? I'm unable to reproduce this issue.
When I press Ctrl+C it prints this:
^CWARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 423964 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 423965 closing signal SIGINT
Aborted!
^CException ignored in atexit callback: <function _exit_function at 0x7f4687d7bf40>
Traceback (most recent call last):
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function
Process Process-1:
p.join()
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 423964 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 423965 closing signal SIGTERM
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt:
Traceback (most recent call last):
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 466, in _finish
self.wait()
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 1003, in wait
time.sleep(self.input_sleep)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 473, in _finish
self.send_interrupt(e)
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 1154, in send_interrupt
self.write_proc_stdin("\x03")
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 1017, in write_proc_stdin
self._write_proc_stdin(data.encode(self.encoding))
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 1282, in _write_proc_stdin
fd = self.process.stdin.fileno()
ValueError: I/O operation on closed file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/cli/launcher/multinode_runner.py", line 50, in run_on_host
fab_conn.local(cmds, hide=False)
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/fabric/connection.py", line 870, in local
return super().run(*args, **kwargs)
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/context.py", line 104, in run
return self._run(runner, command, **kwargs)
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/context.py", line 113, in _run
return runner.run(command, **kwargs)
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 395, in run
return self._run_body(command, **kwargs)
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 451, in _run_body
return self.make_promise() if self._asynchronous else self._finish()
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 489, in _finish
thread.join(self._thread_join_timeout(target))
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/threading.py", line 1096, in join
self._wait_for_tstate_lock()
File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
KeyboardInterrupt