ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: colossalai stucked with GPU1,2 while running with GPU0,1

Open banjiaojuhao opened this issue 2 years ago • 2 comments

🐛 Describe the bug

code at github

expected

run as expected with GPU0,1(connected with nvlink):

$ CUDA_VISIBLE_DEVICES=0,1 colossalai run --nproc_per_node 2 colossal.py
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
[12/11/23 17:38:00] INFO     colossalai - colossalai - INFO:
                             /home/ices/miniconda3/envs/colossalai/lib/python3.1
                             0/site-packages/colossalai/initialize.py:63 launch
                    INFO     colossalai - colossalai - INFO: Distributed
                             environment is initialized, world size: 2
Epoch [1/80]: 100%|██████████| 50/50 [00:26<00:00,  1.91it/s, loss=-]
Epoch [2/80]:  96%|█████████▌| 48/50 [00:21<00:00,  2.27it/s, loss=-]
Mon Dec 11 17:38:22 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:4F:00.0 Off |                  Off |
| 30%   43C    P2              138W / 300W|   9751MiB / 49140MiB |     65%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:52:00.0 Off |                  Off |
| 30%   45C    P2               78W / 300W|   9751MiB / 49140MiB |     34%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:56:00.0 Off |                  Off |
| 30%   29C    P8               26W / 300W|      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:57:00.0 Off |                  Off |
| 30%   29C    P8               19W / 300W|      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000                On | 00000000:CE:00.0 Off |                  Off |
| 30%   28C    P8               26W / 300W|      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000                On | 00000000:D1:00.0 Off |                  Off |
| 30%   29C    P8               24W / 300W|      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     92018      C   ...niconda3/envs/colossalai/bin/python     9748MiB |
|    1   N/A  N/A     92019      C   ...niconda3/envs/colossalai/bin/python     9748MiB |
+---------------------------------------------------------------------------------------+

unexpected

stuck when using GPU1,2(no nvlink between them):

$ CUDA_VISIBLE_DEVICES=1,2 colossalai run --nproc_per_node 2 colossal.py
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[12/11/23 17:39:34] INFO     colossalai - colossalai - INFO:
                             /home/ices/miniconda3/envs/colossalai/lib/python3.1
                             0/site-packages/colossalai/initialize.py:63 launch
                    INFO     colossalai - colossalai - INFO: Distributed
                             environment is initialized, world size: 2
Mon Dec 11 17:39:49 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:4F:00.0 Off |                  Off |
| 30%   31C    P8               25W / 300W|      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:52:00.0 Off |                  Off |
| 30%   38C    P2               85W / 300W|   1089MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:56:00.0 Off |                  Off |
| 30%   34C    P2               97W / 300W|   1089MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:57:00.0 Off |                  Off |
| 30%   29C    P8               19W / 300W|      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000                On | 00000000:CE:00.0 Off |                  Off |
| 30%   28C    P8               26W / 300W|      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000                On | 00000000:D1:00.0 Off |                  Off |
| 30%   29C    P8               24W / 300W|      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    1   N/A  N/A     92238      C   ...niconda3/envs/colossalai/bin/python     1086MiB |
|    2   N/A  N/A     92239      C   ...niconda3/envs/colossalai/bin/python     1086MiB |
+---------------------------------------------------------------------------------------+

GPU topo

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    NIC0    NIC1    CPU Affinity    NUMA Affinity
GPU0     X      NV4     PXB     PXB     SYS     SYS     NODE    NODE    0-11,24-35      0
GPU1    NV4      X      PXB     PXB     SYS     SYS     NODE    NODE    0-11,24-35      0
GPU2    PXB     PXB      X      NV4     SYS     SYS     NODE    NODE    0-11,24-35      0
GPU3    PXB     PXB     NV4      X      SYS     SYS     NODE    NODE    0-11,24-35      0
GPU4    SYS     SYS     SYS     SYS      X      NV4     SYS     SYS     12-23,36-47     1
GPU5    SYS     SYS     SYS     SYS     NV4      X      SYS     SYS     12-23,36-47     1
NIC0    NODE    NODE    NODE    NODE    SYS     SYS      X      PIX
NIC1    NODE    NODE    NODE    NODE    SYS     SYS     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

Environment

nvidia-driver: 530.30.02

python=3.10.13 pytorch==1.13.1 pytorch-cuda=11.7 cuda-toolkit=11.7.1 colossalai=0.3.4

banjiaojuhao avatar Dec 11 '23 09:12 banjiaojuhao

Hi, can you identify where it is getting stuck? I'm unable to reproduce this issue.

flybird11111 avatar Dec 12 '23 02:12 flybird11111

When I press Ctrl+C it prints this:

^CWARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 423964 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 423965 closing signal SIGINT

Aborted!
^CException ignored in atexit callback: <function _exit_function at 0x7f4687d7bf40>
Traceback (most recent call last):
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function
Process Process-1:
    p.join()
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 423964 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 423965 closing signal SIGTERM
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt:
Traceback (most recent call last):
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 466, in _finish
    self.wait()
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 1003, in wait
    time.sleep(self.input_sleep)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 473, in _finish
    self.send_interrupt(e)
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 1154, in send_interrupt
    self.write_proc_stdin("\x03")
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 1017, in write_proc_stdin
    self._write_proc_stdin(data.encode(self.encoding))
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 1282, in _write_proc_stdin
    fd = self.process.stdin.fileno()
ValueError: I/O operation on closed file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/cli/launcher/multinode_runner.py", line 50, in run_on_host
    fab_conn.local(cmds, hide=False)
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/fabric/connection.py", line 870, in local
    return super().run(*args, **kwargs)
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/context.py", line 104, in run
    return self._run(runner, command, **kwargs)
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/context.py", line 113, in _run
    return runner.run(command, **kwargs)
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 395, in run
    return self._run_body(command, **kwargs)
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 451, in _run_body
    return self.make_promise() if self._asynchronous else self._finish()
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/site-packages/invoke/runners.py", line 489, in _finish
    thread.join(self._thread_join_timeout(target))
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/home/ices/miniconda3/envs/colossalai/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt

banjiaojuhao avatar Dec 12 '23 07:12 banjiaojuhao