D-FINE icon indicating copy to clipboard operation
D-FINE copied to clipboard

windows10 单卡训练出现错误

Open QIANXUNZDL123 opened this issue 1 year ago • 9 comments

您好,我在windows上训练的只有一张卡,训练的报错如下:

Traceback (most recent call last): File "E:\D-FINE\src\nn\backbone\hgnetv2.py", line 498, in init if torch.distributed.get_rank() == 0: File "D:\Anaconda2019\envs\qat\lib\site-packages\torch\distributed\distributed_c10d.py", line 1469, in get_rank default_pg = _get_default_group() File "D:\Anaconda2019\envs\qat\lib\site-packages\torch\distributed\distributed_c10d.py", line 940, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 84, in main(args) File "train.py", line 54, in main solver.fit() File "E:\D-FINE\src\solver\det_solver.py", line 24, in fit self.train() File "E:\D-FINE\src\solver_solver.py", line 81, in train self._setup() File "E:\D-FINE\src\solver_solver.py", line 47, in _setup self.model = cfg.model File "E:\D-FINE\src\core\yaml_config.py", line 38, in model self._model = create(self.yaml_cfg['model'], self.global_cfg) File "E:\D-FINE\src\core\workspace.py", line 146, in create module_kwargs[k] = create(_cfg['_name'], global_cfg) File "E:\D-FINE\src\core\workspace.py", line 180, in create return module(**module_kwargs) File "E:\D-FINE\src\nn\backbone\hgnetv2.py", line 512, in init if torch.distributed.get_rank() == 0: File "D:\Anaconda2019\envs\qat\lib\site-packages\torch\distributed\distributed_c10d.py", line 1469, in get_rank default_pg = _get_default_group() File "D:\Anaconda2019\envs\qat\lib\site-packages\torch\distributed\distributed_c10d.py", line 940, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

训练命令: python train.py -c configs/dfine/dfine_hgnetv2_n_coco.yml 请问我应该如何修改呢?

QIANXUNZDL123 avatar Dec 13 '24 07:12 QIANXUNZDL123

我也遇到了相同的问题,请问您解决了嘛?不知道是不是只支持分布式训练?

lycode1202 avatar Dec 25 '24 04:12 lycode1202

我也遇到了相同的问题,请问您解决了嘛?

nhfujfgid avatar Jan 03 '25 14:01 nhfujfgid

我也是遇到了相同的问题,求大佬解答

super-song-sir avatar Jan 04 '25 14:01 super-song-sir

我也是遇到了相同的问题,求大佬解答

我也没解决,只能使用预训练模型进行微调了

QIANXUNZDL123 avatar Jan 06 '25 02:01 QIANXUNZDL123

现在解决了吗大佬

ss989462 avatar Feb 21 '25 06:02 ss989462

没有,放弃了,忙着写大论文了  

6221913077-郑月昌 @.***

 

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2025年2月21日(星期五) 下午2:01 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [Peterande/D-FINE] windows10 单卡训练出现错误 (Issue #115)

现在解决了吗大佬

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***> ss989462 left a comment (Peterande/D-FINE#115)

现在解决了吗大佬

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

nhfujfgid avatar Feb 21 '25 06:02 nhfujfgid

换pytorch能解决吗?感觉不像是代码本身的问题,应该是平台或者torch的

luguoyixiazi avatar Jun 24 '25 16:06 luguoyixiazi

我也是遇到了相同的问题,求大佬解答

我也没解决,只能使用预训练模型进行微调了

这个怎么实现呀

ggyda avatar Jun 25 '25 12:06 ggyda

我也是遇到了相同的问题,求大佬解答

我也没解决,只能使用预训练模型进行微调了

这个怎么实现呀

看文档呀,文档里面不是有介绍的吗

QIANXUNZDL123 avatar Jul 04 '25 08:07 QIANXUNZDL123