Olatunji Ruwase comments

Results 648 comments of


                                            Olatunji Ruwase

Fixed dataset bug in bing_bert.

@wenting-zhao, for more context [here](https://github.com/microsoft/DeepSpeedExamples/blob/25d73cf73fb3dc66faefa141b7319526555be9fc/bing_bert/nvidia_bert_dataset_provider.py#L164-L167) is where a different hdf5 file is selected (using global rank) for each GPU.

Fixed dataset bug in bing_bert.

Yes, all the concerns that you raise with this approach are valid. If I remember correctly, Nvidia used this data pipeline to achieve this [Bert training record](https://developer.nvidia.com/blog/training-bert-with-gpus/) with >= 1024...

Fixed dataset bug in bing_bert.

> Ok, sounds good. Just out of curiosity, I was wondering if the nvidia dataset option was extensively tested with multiple GPUs? I also got the issue mentioned in [microsoft/DeepSpeed#1054](https://github.com/microsoft/DeepSpeed/issues/1054)....

Fixed dataset bug in bing_bert.

That is great progress. Now we know the cause of the hang. So the next step is to figure out why all the processes don't finish their samples in the...

Fixed dataset bug in bing_bert.

@wenting-zhao, I just wanted to check if you are still working on this? Thanks.

unable to prodcude bing_bert with nvidia data

@1024er, apologies for the delay. Will take a closer look asap.

Fixed dataset bug in bing_bert.

@wenting-zhao, thanks for the response. No rush, please. I was just checking in case there was anything I could help with. Thanks so much for working on this important feature.

unable to prodcude bing_bert with nvidia data

@1024er, apologies I have not had much time to explore this.

[BUG] CUDA illegal memory access on large batch with ZeRO-infinity

@kiehls90, from your stack trace it seems the failure is occuring during allgather and hence NCCL-related. My guess is that the memory allocation required by allgather is failing because of...

[BUG] CUDA illegal memory access on large batch with ZeRO-infinity

@drcege, how did you measure the 60% GPU memory usage for `MBS=12`? You could estimate the expected memory usage by comparing a smaller `MBS`.