rank0的启动命令
NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=0 xtuner train train_config/internlm2_5_chat_7b_rank0_server_lora_train.py --deepspeed deepspeed_zero2
rank1的启动命令
NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=1 xtuner train train_config/internlm2_5_chat_7b_rank1_server_lora_train.py --deepspeed deepspeed_zero2
rank1与rank0通信成功 单卡模式都成功训练
报错日志
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in
[rank0]: main()
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main
[rank0]: runner.train()
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
[rank0]: self._train_loop = self.build_train_loop(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
[rank0]: loop = LOOPS.build(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank0]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]: obj = obj_cls(**args) # type: ignore
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
[rank0]: dataloader = runner.build_dataloader(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
[rank0]: dataset = DATASETS.build(dataset_cfg)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank0]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]: obj = obj_cls(**args) # type: ignore
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 305, in process_hf_dataset
[rank0]: group_gloo = dist.new_group(backend='gloo', timeout=xtuner_dataset_timeout)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank0]: func_return = func(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank0]: return _new_group_with_tag(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank0]: pg, pg_store = _new_process_group_helper(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank0]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank0]: RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:276] ss1.ss_family == ss2.ss_family. 2 vs 10
meet same issue, any update for your issue?