Megatron-LM [QUESTION] Why do we need both " train_valid_test_datasets_provider.is_distributed = True" and batched data broadcasting ?

[QUESTION] Why do we need both " train_valid_test_datasets_provider.is_distributed = True" and batched data broadcasting ?

Open rayleizhu opened this issue 1 year ago • 0 comments

I noticed that when train_valid_test_datasets_provider.is_distributed = True data loader is created in all processes, ignoring their tensor parallel rank.

https://github.com/NVIDIA/Megatron-LM/blob/c02b335b6318ada8c6a38c95ce3c754da2a579f9/pretrain_vlm.py#L333

https://github.com/NVIDIA/Megatron-LM/blob/c02b335b6318ada8c6a38c95ce3c754da2a579f9/megatron/training/training.py#L1685

However, in get_batch(), the batched data is still broadcasted:

https://github.com/NVIDIA/Megatron-LM/blob/c02b335b6318ada8c6a38c95ce3c754da2a579f9/pretrain_vlm.py#L242

I got confused why do we need both of them? My understanding is that we need either distributed access or broadcasting from tp rank 0, not both of them.

Oct 04 '24 02:10 rayleizhu

Megatron-LM Megatron-LM copied to clipboard

[QUESTION] Why do we need both " train_valid_test_datasets_provider.is_distributed = True" and batched data broadcasting ?

Megatron-LM
Megatron-LM copied to clipboard