Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

What is necessary in the training script?

Open ArnaudFickinger opened this issue 1 year ago • 2 comments

I noticed that coordinator.block_all(), torch.set_num_threads(1) and dist.barrier() were added to the training script. Were they added for debugging purpose only or are they useful for training?

ArnaudFickinger avatar Jun 21 '24 23:06 ArnaudFickinger

Actually, they are useful for training when you train the model in a large-scale distributed system. We place them in the appropriate place to make the distributed training more stable.

If you are training on a small scale, or pre-training with a very robust distributed system, you can try removing them. But these sentences will introduce neglectible overhead.

zhengzangw avatar Jun 22 '24 09:06 zhengzangw

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Jun 30 '24 01:06 github-actions[bot]

#549 fixes this problem. Now by default the dist.barrier() is not enabled, and you can define record_time=True to open previous behaviour.

zhengzangw avatar Jul 08 '24 07:07 zhengzangw