What is necessary in the training script?
I noticed that coordinator.block_all(), torch.set_num_threads(1) and dist.barrier() were added to the training script. Were they added for debugging purpose only or are they useful for training?
Actually, they are useful for training when you train the model in a large-scale distributed system. We place them in the appropriate place to make the distributed training more stable.
If you are training on a small scale, or pre-training with a very robust distributed system, you can try removing them. But these sentences will introduce neglectible overhead.
This issue is stale because it has been open for 7 days with no activity.
#549 fixes this problem. Now by default the dist.barrier() is not enabled, and you can define record_time=True to open previous behaviour.