Synchronized-BatchNorm-PyTorch Training cannot start

Hi,

Good job! I tried to used it as

device=torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = Model(...)
model = nn.DataParallel(model, device_ids=[0, 1])
model = convert_model(model).to(device)

However, it stucked and training couldn 't start. Have you seem similar problems before ?

Oct 08 '20 09:10 shuuchen

@shuuchen I met the same problem. Have you solved it?

Oct 16 '20 12:10 AManHasNoName12138

Up ?

Nov 18 '20 10:11 rvandeghen

I wrote a test script which solved the hanging problem. I guess there is something wrong with reduce function.

Please refer to the script: https://gist.github.com/shuuchen/7463009370e9ddf77e649f3fec259024

You can adapt the code with your own task easily.

Nov 18 '20 13:11 shuuchen

@shuuchen I guess you used the pytorch tutorial on DDP. I've dived into it and will most likely use the SyncBatchNorm implemented by pytorch as well. Anyway thank you as it may help others :)

Nov 18 '20 16:11 rvandeghen

I wrote a test script which solved the hanging problem. I guess there is something wrong with reduce function.

Please refer to the script: https://gist.github.com/shuuchen/7463009370e9ddf77e649f3fec259024

You can adapt the code with your own task easily.

have you find out the exact problem in this provided syncBatchNorm?

Apr 08 '21 13:04 zwyking

@shuuchen I met the same problem. Have you solved it?

hi, have you solved this problem?

Apr 08 '21 13:04 zwyking

See also https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/issues/44#issuecomment-815135207

Apr 08 '21 19:04 vacancy

Synchronized-BatchNorm-PyTorch Synchronized-BatchNorm-PyTorch copied to clipboard

Training cannot start

Synchronized-BatchNorm-PyTorch
Synchronized-BatchNorm-PyTorch copied to clipboard