Olatunji Ruwase

Results 648 comments of Olatunji Ruwase

@wenting-zhao, for more context [here](https://github.com/microsoft/DeepSpeedExamples/blob/25d73cf73fb3dc66faefa141b7319526555be9fc/bing_bert/nvidia_bert_dataset_provider.py#L164-L167) is where a different hdf5 file is selected (using global rank) for each GPU.

Yes, all the concerns that you raise with this approach are valid. If I remember correctly, Nvidia used this data pipeline to achieve this [Bert training record](https://developer.nvidia.com/blog/training-bert-with-gpus/) with >= 1024...

> Ok, sounds good. Just out of curiosity, I was wondering if the nvidia dataset option was extensively tested with multiple GPUs? I also got the issue mentioned in [microsoft/DeepSpeed#1054](https://github.com/microsoft/DeepSpeed/issues/1054)....

That is great progress. Now we know the cause of the hang. So the next step is to figure out why all the processes don't finish their samples in the...

@wenting-zhao, I just wanted to check if you are still working on this? Thanks.

@1024er, apologies for the delay. Will take a closer look asap.

@wenting-zhao, thanks for the response. No rush, please. I was just checking in case there was anything I could help with. Thanks so much for working on this important feature.

@1024er, apologies I have not had much time to explore this.

@kiehls90, from your stack trace it seems the failure is occuring during allgather and hence NCCL-related. My guess is that the memory allocation required by allgather is failing because of...

@drcege, how did you measure the 60% GPU memory usage for `MBS=12`? You could estimate the expected memory usage by comparing a smaller `MBS`.