Synchronized-BatchNorm-PyTorch icon indicating copy to clipboard operation
Synchronized-BatchNorm-PyTorch copied to clipboard

Training cannot start

Open shuuchen opened this issue 4 years ago • 7 comments

Hi,

Good job! I tried to used it as

device=torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = Model(...)
model = nn.DataParallel(model, device_ids=[0, 1])
model = convert_model(model).to(device)

However, it stucked and training couldn 't start. Have you seem similar problems before ?

shuuchen avatar Oct 08 '20 09:10 shuuchen

@shuuchen I met the same problem. Have you solved it?

AManHasNoName12138 avatar Oct 16 '20 12:10 AManHasNoName12138

Up ?

rvandeghen avatar Nov 18 '20 10:11 rvandeghen

I wrote a test script which solved the hanging problem. I guess there is something wrong with reduce function.

Please refer to the script: https://gist.github.com/shuuchen/7463009370e9ddf77e649f3fec259024

You can adapt the code with your own task easily.

shuuchen avatar Nov 18 '20 13:11 shuuchen

@shuuchen I guess you used the pytorch tutorial on DDP. I've dived into it and will most likely use the SyncBatchNorm implemented by pytorch as well. Anyway thank you as it may help others :)

rvandeghen avatar Nov 18 '20 16:11 rvandeghen

I wrote a test script which solved the hanging problem. I guess there is something wrong with reduce function.

Please refer to the script: https://gist.github.com/shuuchen/7463009370e9ddf77e649f3fec259024

You can adapt the code with your own task easily.

have you find out the exact problem in this provided syncBatchNorm?

zwyking avatar Apr 08 '21 13:04 zwyking

@shuuchen I met the same problem. Have you solved it?

hi, have you solved this problem?

zwyking avatar Apr 08 '21 13:04 zwyking

See also https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/issues/44#issuecomment-815135207

vacancy avatar Apr 08 '21 19:04 vacancy