Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] tensor_parallel.broadcast_data and train_valid_test_datasets_provider.is_distributed = True

Open KookHoiKim opened this issue 5 months ago • 0 comments

In my understanding, in pretrain code, it broadcasts the data from tp rank 0 to the rest tp rank gpus.

However, if i activate the option train_valid_test_datasets_provider.is_distributed = True while building dataloader, dataloader would be initialized on every gpus. And it seems they return same data on every iteration. Then what does tensor_parallel.broadcast_data do for in this case?

I am not sure that i understood the procedure of broadcasting data , so i would be very grateful if give me any information about this. Thanks.

KookHoiKim avatar Sep 09 '24 11:09 KookHoiKim