DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Fixed dataset bug in bing_bert.

Open wenting-zhao opened this issue 4 years ago • 17 comments

To fix the issue reported in this issue: https://github.com/microsoft/DeepSpeed/issues/1243

wenting-zhao avatar Jul 21 '21 18:07 wenting-zhao

@wenting-zhao, thanks for looking into this. Can you describe what difference you see with this fix? Is the training loss curve or the throughput improved?

tjruwase avatar Jul 21 '21 20:07 tjruwase

@wenting-zhao, for more context, our port for the nvidia dataset was based on this nvidia bert code which used RandomSampler because of how their dataset was organized. I have not tracked changes to their dataset which might necessitate your fix, so I would appreciate your response. Thanks!

tjruwase avatar Jul 21 '21 20:07 tjruwase

Hi @tjruwase,

Let's say if you have n datapoints in a hdf5 files, with randomSampler, when you iterate over the dataloader created from the hdf5 file, the length of the iterator (i.e., enumerate(tqdm(dataset_iterator, smoothing=1))) will be n/batch_size. No matter how you increase the number of GPUs, the length of the iterator stays the same. In other words, the time to iterate over a dataset is the same regardless of how many GPUs you use.

However, with distributed sampler, the length of the iterator (enumerate(tqdm(dataset_iterator, smoothing=1))) becomes n/batch_size/num_of_GPUs. I assume this is the correct one? This way, it actually takes shorter time to iterate over a dataset. When you have more GPUs, you can iterate over a dataset with less time.

Let me know if this is unclear.

wenting-zhao avatar Jul 23 '21 16:07 wenting-zhao

@wenting-zhao, thanks for your explanation. Your description is correct but only applies to the case where multiple GPUs are processing one hdf5 file. However when we did this port the nvidia dataset was sharded into multiple hdf5 files and each file was processed completely by exactly one GPU as you can see here and here in the baseline Nvidia code.

I have not kept up to date with this dataset so I don't know if the entire dataset is now a single hdf5 file. Is that what you observed?

tjruwase avatar Jul 23 '21 17:07 tjruwase

@wenting-zhao, for more context here is where a different hdf5 file is selected (using global rank) for each GPU.

tjruwase avatar Jul 23 '21 17:07 tjruwase

Ok, I see. This makes sense. The printed message is a bit confusing. I have turing.logger - worker-0: begin epoch 1 current_sample_count 0 shard_length 708875 global_data_samples 0

So I all the processes loaded the same hdf5 file, because I only have one hdf5 file that has length 708875.

Are there specific reasons for making each process to run on different datasets, though? This seems counter-intuitive. # of hdf5 files may be not divisible by # of GPUs. And each hdf5 file could have different lengths. Or at least it would be helpful for making this a choice for users. Another issue is this would take too long for a checkpoint to be saved.

wenting-zhao avatar Jul 23 '21 18:07 wenting-zhao

Yes, all the concerns that you raise with this approach are valid. If I remember correctly, Nvidia used this data pipeline to achieve this Bert training record with >= 1024 GPUs back in 2019. In that case, it would be quite inefficient for each file to be processed by all the GPUs. We simply replicated that data pipeline in order to reproduce their results with DeepSpeed.

However, I agree that this pipeline might not be useful for most users who have at most 10s of GPUs. So I am happy to switch to all GPUs processing the same file using DistributedSampler. However, there are still a couple of things required for this PR to achieve that.

  1. Modify this here to ensure each GPU reads all files, not just those corresponding to their global rank.
  2. E2E run with sequence 128 and 512 that demonstrates the baseline convergence is achieved.

I realize (2) is a non-trivial task, but it is required to ensure that nothing is broken.

tjruwase avatar Jul 23 '21 19:07 tjruwase

Ok, sounds good. Just out of curiosity, I was wondering if the nvidia dataset option was extensively tested with multiple GPUs? I also got the issue mentioned in https://github.com/microsoft/DeepSpeed/issues/1054. I would appreciate any insights in solving this problem! Both random sampler and distributed sampler have this problem.

wenting-zhao avatar Jul 23 '21 20:07 wenting-zhao

Ok, I fixed the save issue with distributed sampler, I will put together a fix and run E2E, in case there are other people like me who needs to pretrain models using this code.

wenting-zhao avatar Jul 23 '21 20:07 wenting-zhao

Ok, sounds good. Just out of curiosity, I was wondering if the nvidia dataset option was extensively tested with multiple GPUs? I also got the issue mentioned in microsoft/DeepSpeed#1054. I would appreciate any insights in solving this problem! Both random sampler and distributed sampler have this problem.

I suspect that this problem might be related to the issue you are now trying to fix. Can you try this in a 2-GPU case, replace the save_checkpoint() call with a log message indicating progress. Then check if all the ranks are behaving the same way. You can also do this for any number of GPUs, 2 is just the simplest case.

tjruwase avatar Jul 23 '21 21:07 tjruwase

The problem is that not all processes can reach save_checkpoint(). I inserted a print statement at https://github.com/microsoft/DeepSpeedExamples/blob/25d73cf73fb3dc66faefa141b7319526555be9fc/bing_bert/deepspeed_train.py#L220 , only one process can reach here. :)

wenting-zhao avatar Jul 23 '21 22:07 wenting-zhao

That is great progress. Now we know the cause of the hang. So the next step is to figure out why all the processes don't finish their samples in the data file at the same time. It is possible they are stuck waiting for more data. You might find this command line --max_steps_per_epoch useful for this investigation. As you can see here, here, and here, it allows the processes to early-exit from the training loop without consuming all the samples in the data file (or epoch). Hope that helps.

tjruwase avatar Jul 23 '21 23:07 tjruwase

@wenting-zhao, I just wanted to check if you are still working on this? Thanks.

tjruwase avatar Aug 23 '21 15:08 tjruwase

Yes, sorry, I am still working on this. I will try to put together a commit asap.

wenting-zhao avatar Aug 23 '21 21:08 wenting-zhao

@wenting-zhao, thanks for the response. No rush, please. I was just checking in case there was anything I could help with. Thanks so much for working on this important feature.

tjruwase avatar Aug 23 '21 21:08 tjruwase

CLA assistant check
All CLA requirements met.

ghost avatar Aug 30 '21 04:08 ghost

Hi I have put together a commit for people who want to use deep speed to do multi-node pretraining. The code is both tested with the wiki dataset (though I didn't have the resource to finish pretraining) and my own dataset (which is at the some scale of the wiki dataset for five epochs). Three main modifications:

  1. Made it to work with Distributed sampler: all workers are reading the same data shard and will not hang with multi nodes.
  2. When loading a previous trained model, reload the optimizer and learning rate states.
  3. Added gc collectors to better work with memory and avoid OOM.

I do realize the code is not perfect now. One issue that I am aware of is to train on a dataset for one epoch, you actually need to set the number of epoch in config file to be the number of data shards. Unfortunately, for the time being I don't have resource to add and test this part of code.

Let me know if I can be of further help or if you have any questions.

wenting-zhao avatar Aug 30 '21 04:08 wenting-zhao