VFF icon indicating copy to clipboard operation
VFF copied to clipboard

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Open Fragilesky opened this issue 1 year ago • 4 comments

wen I run "python3 train.py --cfg_file cfgs/kitti_models/VFF_PVRCNN.yaml", there has a problem:

#Traceback (most recent call last): | 0/3712 [00:00<?, ?it/s] File "train.py", line 198, in main() File "train.py", line 170, in main merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch File "/home/hu/VFF/tools/train_utils/train_utils.py", line 93, in train_model dataloader_iter=dataloader_iter File "/home/hu/VFF/tools/train_utils/train_utils.py", line 38, in train_one_epoch loss, tb_dict, disp_dict = model_func(model, batch) File "/home/hu/VFF/pcdet/models/init.py", line 44, in model_func ret_dict, tb_dict, disp_dict = model(batch_dict) File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/hu/VFF/pcdet/models/detectors/pv_rcnn_fusion.py", line 11, in forward batch_dict = cur_module(batch_dict) File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/hu/VFF/pcdet/models/backbones_3d/vfe/image_point_vfe.py", line 100, in forward batch_dict = self.ffn(batch_dict) File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/hu/VFF/pcdet/models/backbones_3d/vfe/image_vfe_modules/ffn/pyramid_ffn.py", line 57, in forward ifn_result = self.ifn(images) File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/hu/VFF/pcdet/models/backbones_3d/vfe/image_vfe_modules/ffn/ifn/seg_template.py", line 123, in forward features = self.model.backbone(x) File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torchvision/models/_utils.py", line 63, in forward x = module(x) File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 532, in forward world_size = torch.distributed.get_world_size(process_group) File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 711, in get_world_size return _get_group_size(group) File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 263, in _get_group_size default_pg = _get_default_group() File "/home/hu/anaconda3/envs/pcdet/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group raise RuntimeError("Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

It seems need to change multi GPU training to single GPU, like set "SyncBN" to "BN" or other, but I don't know where it is or some other solutions

Fragilesky avatar Aug 31 '22 06:08 Fragilesky

Same problem here. Maybe you can simply solve this problem with GPU_NUM>1.

xxxxhh avatar Sep 01 '22 08:09 xxxxhh

Hi, I guess SyncBN is required to be changed to BN in this situation.

yanwei-li avatar Sep 04 '22 05:09 yanwei-li

@xxxxhh @yanwei-li Thanks your reply!Now I can only use single GPU, and I have not found the configuration file that set SyncBN to BN, I'm looking for new solution...

Fragilesky avatar Sep 06 '22 02:09 Fragilesky

@xxxxhh @yanwei-li Thanks your reply!Now I can only use single GPU, and I have not found the configuration file that set SyncBN to BN, I'm looking for new solution...

Maybe you can try to change the syncBN in train.py manually. : )

xxxxhh avatar Sep 06 '22 03:09 xxxxhh