ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: RuntimeError: rank<size. 0 vs 0

Open nameless0704 opened this issue 2 years ago • 2 comments

🐛 Describe the bug

I'm running ./applications/ChatGPT/examples/train_dummy.py but changed backend from "NCCL" to "GLOO"( in ColossalAI initialize.py) and set a bunch of environment variables

import os
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"
os.environ["RANK"] = '0'
os.environ["LOCAL_RANK"] = '0'
os.environ["WORLD_SIZE"] = '0'
os.environ["MASTER_ADDR"] = '127.0.0.1'
os.environ["MASTER_PORT"] = '80'

I run this script with --strategy ddp to see if it could run on a 4 GPU machine. But the following error came up.

C:\Users\abcd\Desktop\ColossalAI\applications\ChatGPT\examples>python train_dummy.py --strategy colossalai_zero2
2023-02-24 09:53:34.324233: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
    registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterSchema.cpp:6
  dispatch key: Meta
  previous kernel: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\BatchRulesScatterOps.cpp:1053
       new kernel: registered at /dev/null:219 (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\dispatch\OperatorEntry.cpp:156.)
  self.m.impl(name, dispatch_key, fn)
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [WIN-MEUAL5THNML]:80 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [WIN-MEUAL5THNML]:80 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
  File "train_dummy.py", line 125, in <module>
    main(args)
  File "train_dummy.py", line 38, in main
    strategy = ColossalAIStrategy(stage=2, placement_policy='cuda')
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\chatgpt\trainer\strategies\colossalai.py", line 77, in __init__
    super().__init__(seed)
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\chatgpt\trainer\strategies\ddp.py", line 25, in __init__
    super().__init__()
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\chatgpt\trainer\strategies\base.py", line 23, in __init__
    self.setup_distributed()
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\chatgpt\trainer\strategies\colossalai.py", line 110, in setup_distributed
    colossalai.launch_from_torch({}, seed=self.seed)
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\colossalai-0.2.5-py3.7.egg\colossalai\initialize.py", line 227, in launch_from_torch
    verbose=verbose)
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\colossalai-0.2.5-py3.7.egg\colossalai\initialize.py", line 99, in launch
    gpc.init_global_dist(rank, world_size, backend, host, port)
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\colossalai-0.2.5-py3.7.egg\colossalai\context\parallel_context.py", line 374, in init_global_dist
    dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method)
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\distributed_c10d.py", line 769, in init_process_group
    timeout=timeout,
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\distributed_c10d.py", line 862, in _new_process_group_helper
    pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\third_party\gloo\gloo\context.cc:27] rank < size. 0 vs 0

How can I fix this?

Environment

Windows Pytorch1.13.1+cu117 Python37 gloo backend multigpu, CUDA 11.6

nameless0704 avatar Feb 24 '23 03:02 nameless0704

I also want to know what master address/port should I use to avoid system error: 10049

nameless0704 avatar Feb 24 '23 07:02 nameless0704

Hi, these environment variables should be set carefully. We generally recommend that users can launch by torchrun

ver217 avatar Feb 28 '23 07:02 ver217

Hi @nameless0704 Please check the installation instructions, we are not support Windows currently. https://github.com/hpcaitech/ColossalAI#Installation The Chat part also has been updated a lot. https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 20 '23 09:04 binmakeswell