RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
用一张3090训练会出现如下的问题,我的训练命令是python train.py -c configs/dfine/dfine_hgnetv2_l_coco.yml,请问是否有配置选项可以关闭分布式功能。或者说能使用单卡训练dfine吗?
Traceback (most recent call last): File "/workspace/D-FINE/src/nn/backbone/hgnetv2.py", line 498, in init if torch.distributed.get_rank() == 0: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1173, in get_rank default_pg = _get_default_group() File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 707, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/workspace/D-FINE/train.py", line 84, in
用一张3090训练会出现如下的问题,我的训练命令是python train.py -c configs/dfine/dfine_hgnetv2_l_coco.yml,请问是否有配置选项可以关闭分布式功能。或者说能使用单卡训练dfine吗? 回溯(最近一次调用最后):文件“/workspace/D-FINE/src/nn/backbone/hgnetv2.py”,第 498 行,init if torch.distributed.get_rank() == 0:文件“/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py”,第 1173 行,get_rank default_pg = _get_default_group() 文件“/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py”,第 707 行,_get_default_groupraise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. 在处理上述异常的过程中,发生了另一个异常: 回溯(最近一次调用最后一次):文件 “/workspace/D-FINE/train.py”,第 84 行,在 main(args) 文件中 文件“/workspace/D-FINE/train.py”,第 54 行,在 main solver.fit() 文件中 “/workspace/D-FINE/src/solver/det_solver.py”,第 24 行,在 fit self.train() 文件 “/workspace/D-FINE/src/solver/_solver.py”,第 81 行,在 train self._setup() 文件中 “/workspace/D-FINE/src/solver/_solver.py”,第 47 行,在 _setup 中 self.model = cfg.model 文件 “/workspace/D-FINE/src/core/yaml_config.py”,第 38 行,在模型 self._model = create(self.yaml_cfg['model'], self.global_cfg) 文件 “/workspace/D-FINE/src/core/workspace.py”,第 146 行,在创建module_kwargs[k] = create(_cfg['_name'], global_cfg) 文件 “/workspace/D-FINE/src/core/workspace.py”,第 180 行,在创建返回模块(module_kwargs)文件 “/workspace/D-FINE/src/nn/backbone/hgnetv2.py”,第 512 行,如果 torch.distributed.get_rank() == 0,则在 init** 中:文件 “/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py”,第 1173 行,get_rank default_pg = _get_default_group() 文件“/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py”,第 707 行,_get_default_group引发 RuntimeError(RuntimeError:默认进程组尚未初始化,请务必调用 init_process_group。
Have you solved this problem? I also encountered the same problem
CUDA_VISIBLE_DEVICES=0 torchrun --master_port=7777 --nproc_per_node=1 train.py -c configs/dfine/dfine_hgnetv2_l_coco.yml --use-amp --seed=0这样就可以成功运行了
请问window怎么办啊 这个CUDA_VISIBLE_DEVICES=0 torchrun --master_port=7777 --nproc_per_node=1命令win上面用不了
请问有办法解决吗