Olatunji Ruwase
Olatunji Ruwase
@wenting-zhao, for more context [here](https://github.com/microsoft/DeepSpeedExamples/blob/25d73cf73fb3dc66faefa141b7319526555be9fc/bing_bert/nvidia_bert_dataset_provider.py#L164-L167) is where a different hdf5 file is selected (using global rank) for each GPU.
Yes, all the concerns that you raise with this approach are valid. If I remember correctly, Nvidia used this data pipeline to achieve this [Bert training record](https://developer.nvidia.com/blog/training-bert-with-gpus/) with >= 1024...
> Ok, sounds good. Just out of curiosity, I was wondering if the nvidia dataset option was extensively tested with multiple GPUs? I also got the issue mentioned in [microsoft/DeepSpeed#1054](https://github.com/microsoft/DeepSpeed/issues/1054)....
That is great progress. Now we know the cause of the hang. So the next step is to figure out why all the processes don't finish their samples in the...
@wenting-zhao, I just wanted to check if you are still working on this? Thanks.
@1024er, apologies for the delay. Will take a closer look asap.
@wenting-zhao, thanks for the response. No rush, please. I was just checking in case there was anything I could help with. Thanks so much for working on this important feature.
@1024er, apologies I have not had much time to explore this.
@kiehls90, from your stack trace it seems the failure is occuring during allgather and hence NCCL-related. My guess is that the memory allocation required by allgather is failing because of...
@drcege, how did you measure the 60% GPU memory usage for `MBS=12`? You could estimate the expected memory usage by comparing a smaller `MBS`.