SparseR-CNN icon indicating copy to clipboard operation
SparseR-CNN copied to clipboard

多卡训练报错

Open 1061136002 opened this issue 4 years ago • 5 comments

您好!我用训练的命令进行了训练:python projects/SparseRCNN/train_net.py --config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml --num-gpus 4 --gpu "0, 1, 2,3".目前是单机4卡训练。但是训练时报如下错误: RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1595629416375/work/third_party/gloo/gloo/transport/tcp/device.cc:208] ifa != nullptr. Unable to find interface for: [0.0.8.34] 请问该如何解决啊!!!

1061136002 avatar Dec 09 '20 14:12 1061136002

gloo设置下,老铁,我也遇到过,通信接口问题

wutheringcoo avatar Dec 10 '20 10:12 wutheringcoo

gloo设置下,老铁,我也遇到过,通信接口问题

在哪设置呢,老哥。这玩意儿完全不懂

1061136002 avatar Dec 10 '20 10:12 1061136002

you can add os.environ['GLOO_SOCKET_IFNAME'] = 'eno1' in train_net.py. like this:

if __name__ == "__main__":
    args = default_argument_parser().parse_args()  
    print("Command Line Args:", args)  
    os.environ['GLOO_SOCKET_IFNAME'] = 'eno1'  
    launch(  
        main,  
        args.num_gpus,  
        num_machines=args.num_machines,  
        machine_rank=args.machine_rank,  
        dist_url=args.dist_url,  
        args=(args,),  
    )  

lujzz avatar May 29 '21 10:05 lujzz

请问一下你们训练自己的数据集是怎么配置的呢

hellojiabin avatar Oct 11 '21 10:10 hellojiabin

ok

Kunlei-Hong avatar Nov 30 '21 10:11 Kunlei-Hong