[BUG]: RuntimeError: rank<size. 0 vs 0
🐛 Describe the bug
I'm running ./applications/ChatGPT/examples/train_dummy.py but changed backend from "NCCL" to "GLOO"( in ColossalAI initialize.py) and set a bunch of environment variables
import os
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"
os.environ["RANK"] = '0'
os.environ["LOCAL_RANK"] = '0'
os.environ["WORLD_SIZE"] = '0'
os.environ["MASTER_ADDR"] = '127.0.0.1'
os.environ["MASTER_PORT"] = '80'
I run this script with --strategy ddp to see if it could run on a 4 GPU machine. But the following error came up.
C:\Users\abcd\Desktop\ColossalAI\applications\ChatGPT\examples>python train_dummy.py --strategy colossalai_zero2
2023-02-24 09:53:34.324233: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterSchema.cpp:6
dispatch key: Meta
previous kernel: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\BatchRulesScatterOps.cpp:1053
new kernel: registered at /dev/null:219 (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\dispatch\OperatorEntry.cpp:156.)
self.m.impl(name, dispatch_key, fn)
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [WIN-MEUAL5THNML]:80 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [WIN-MEUAL5THNML]:80 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
File "train_dummy.py", line 125, in <module>
main(args)
File "train_dummy.py", line 38, in main
strategy = ColossalAIStrategy(stage=2, placement_policy='cuda')
File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\chatgpt\trainer\strategies\colossalai.py", line 77, in __init__
super().__init__(seed)
File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\chatgpt\trainer\strategies\ddp.py", line 25, in __init__
super().__init__()
File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\chatgpt\trainer\strategies\base.py", line 23, in __init__
self.setup_distributed()
File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\chatgpt\trainer\strategies\colossalai.py", line 110, in setup_distributed
colossalai.launch_from_torch({}, seed=self.seed)
File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\colossalai-0.2.5-py3.7.egg\colossalai\initialize.py", line 227, in launch_from_torch
verbose=verbose)
File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\colossalai-0.2.5-py3.7.egg\colossalai\initialize.py", line 99, in launch
gpc.init_global_dist(rank, world_size, backend, host, port)
File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\colossalai-0.2.5-py3.7.egg\colossalai\context\parallel_context.py", line 374, in init_global_dist
dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method)
File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\distributed_c10d.py", line 769, in init_process_group
timeout=timeout,
File "C:\Users\abcd\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\distributed_c10d.py", line 862, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\third_party\gloo\gloo\context.cc:27] rank < size. 0 vs 0
How can I fix this?
Environment
Windows Pytorch1.13.1+cu117 Python37 gloo backend multigpu, CUDA 11.6
I also want to know what master address/port should I use to avoid system error: 10049
Hi, these environment variables should be set carefully. We generally recommend that users can launch by torchrun
Hi @nameless0704 Please check the installation instructions, we are not support Windows currently. https://github.com/hpcaitech/ColossalAI#Installation The Chat part also has been updated a lot. https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat This issue was closed due to inactivity. Thanks.