About your DDP code

Open chaolongy opened this issue 3 years ago • 0 comments

Thank you very much for your excellent work, I had the following problems in reading your training code and test code： In 'train.py' file, val_sampler=torch.utils.data.distributed.DistributedSampler(val_data) , with dist.all_reduce(intersection), dist.all_reduce(union), dist.all_reduce(target). In 'test.py' file, val_sampler=None, with dist.all_reduce(output_3d).

My question：

Why is the sampler inconsistent here?
I found that performance did not change when val_sampler=None. What is the significance of dist.all_reduce() here?
I found that when val_sampler=torch.utils.data.distributed.DistributedSampler(val_data) and dist.all_reduce() was not used, mIOU increased during testing. Why is this?
The last question is that in the 'train.py' file, the "intersectionAndUnionGPU" function in the ‘util.py’ file is used, while in the 'test.py' file, the ‘evaluate’ function in the ‘iou.py’ file is used. What are the essential differences between the two evaluation metrics in terms of application?

I look forward to hearing from you and thank you again for your excellent work.

May 04 '22 02:05 chaolongy