xtuner icon indicating copy to clipboard operation
xtuner copied to clipboard

多机多卡训练报错ss1.ss_family == ss2.ss_family. 2 vs 10

Open sph116 opened this issue 1 year ago • 1 comments

rank0的启动命令 NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=0 xtuner train train_config/internlm2_5_chat_7b_rank0_server_lora_train.py --deepspeed deepspeed_zero2 rank1的启动命令 NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=1 xtuner train train_config/internlm2_5_chat_7b_rank1_server_lora_train.py --deepspeed deepspeed_zero2

rank1与rank0通信成功 单卡模式都成功训练

报错日志

[rank0]: Traceback (most recent call last): [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in [rank0]: main() [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main [rank0]: runner.train() [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train [rank0]: self._train_loop = self.build_train_loop( [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop [rank0]: loop = LOOPS.build( [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build [rank0]: return self.build_func(cfg, *args, **kwargs, registry=self) [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg [rank0]: obj = obj_cls(**args) # type: ignore [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init [rank0]: dataloader = runner.build_dataloader( [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader [rank0]: dataset = DATASETS.build(dataset_cfg) [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build [rank0]: return self.build_func(cfg, *args, **kwargs, registry=self) [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg [rank0]: obj = obj_cls(**args) # type: ignore [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 305, in process_hf_dataset [rank0]: group_gloo = dist.new_group(backend='gloo', timeout=xtuner_dataset_timeout) [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper [rank0]: func_return = func(*args, **kwargs) [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group [rank0]: return _new_group_with_tag( [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag [rank0]: pg, pg_store = _new_process_group_helper( [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper [rank0]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) [rank0]: RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:276] ss1.ss_family == ss2.ss_family. 2 vs 10

sph116 avatar Sep 06 '24 13:09 sph116

meet same issue, any update for your issue?

heixue509 avatar Jan 10 '25 03:01 heixue509