[QUESTION] Why take too much time to sync up barrier information between ranks
An issue is identified when we test megatron-LM by going with 6B model with 1K GPUs. Basically, by checking the output of each iteration, we found the difference of min and max of params-all-gather with respect a specific iteration is too much as following. In this case, the TFLOPs is low.
By checking the code, we believe this issue is caused by the barrier in the timers as illustrated below (distrib_optimizer.py:step()).
timers('params-all-gather', log_level=1).start(barrier=args.barrier_with_L1_time)
self._reset_metadata_and_sync_gather_all_model_params(force_sync=False)
timers('params-all-gather').stop()
To debug this issue, we added 1 line to print the timestamp after call as following:
timers('params-all-gather', log_level=1).start(barrier=args.barrier_with_L1_time)
print("go out of the barrier of timer: ", time.time())
self._reset_metadata_and_sync_gather_all_model_params(force_sync=False)
timers('params-all-gather').stop()
By checking the timestamps printed by different ranks, we found the maximum time difference between the ranks is 200+ms which is too big. It would be highly appreciated if any clue of this big time difference. Many thanks in adance.
Marking as stale. No activity in 60 days.