DeepSpeedExamples unable to prodcude bing

I used the default setting and run the code with 32 V100, the data was constructed from nvidia scripts. I was able to reproduce nvidia's results on Squad (F1=90), but failed to reoroduce deepspeed results.

stage 1, maxlen 128 (epoch 0-16) stage 2, maxlen 512 (epoch 17-20)

I evaluated the epoch20 checkpoint on BingBertSquad, and the results were unexpected low: F1=21

Thank you .

Jan 08 '22 05:01 1024er

@jeffra @tjruwase

Feb 08 '22 15:02 1024er

@1024er, apologies for the delay. Will take a closer look asap.

Feb 09 '22 20:02 tjruwase

@1024er, apologies for the delay. Will take a closer look asap.

Thank you so much ~ Is there any progress please? ：）

Feb 19 '22 11:02 1024er

@1024er, apologies I have not had much time to explore this.

Feb 19 '22 13:02 tjruwase

@1024er, apologies I have not had much time to explore this.

Hi，do you have time to take a look recently ? Thank you @tjruwase

Mar 18 '22 09:03 1024er

Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to this issue)?

If so, I wonder if it is convenient for you to share your solution? If not, could you please tell me the DeepSpeed version and the Nvidia docker version you used in your experiments?

Apr 25 '22 12:04 haolin-nju

Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to this issue)?

If so, I wonder if it is convenient for you to share your solution? If not, could you please tell me the DeepSpeed version and the Nvidia docker version you used in your experiments?

This is due to the inconsistent amount of data on each gpu. This problem can be solved by processing the same amount of data per shard.

May 15 '22 17:05 1024er

Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to this issue)? If so, I wonder if it is convenient for you to share your solution? If not, could you please tell me the DeepSpeed version and the Nvidia docker version you used in your experiments?

This is due to the inconsistent amount of data on each gpu. This problem can be solved by processing the same amount of data per shard.

Thank you for your reply. I'll have a try and share my results asap.

May 16 '22 02:05 haolin-nju

Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to this issue)? If so, I wonder if it is convenient for you to share your solution? If not, could you please tell me the DeepSpeed version and the Nvidia docker version you used in your experiments?

This is due to the inconsistent amount of data on each gpu. This problem can be solved by processing the same amount of data per shard.

Thank you for your reply. I'll have a try and share my results asap.

Have you encountered the problem with a very low F value mentioned above ?

Jun 11 '22 07:06 1024er

Hi @1024er I'd suggest to double check whether the checkpoints are loaded successfully or not by checking the return values of the model.load_state_dict calls in https://github.com/microsoft/DeepSpeedExamples/blob/master/BingBertSquad/nvidia_run_squad_deepspeed.py#L828

There should be 2~4 missing keys depending on whether you use LM or not. If you see more than that, that's the problem

Jun 14 '22 22:06 HeyangQin

Hi, @1024er . Sorry for late reply. I've tried your method in the last few weeks. However, DeepSpeed still have some other bugs such as optimizer loss becomes NaN in fp16 mode.

I'm not working on this framework because I can't reproduce DeepSpeed's BERT results for almost two months. However, from my experience, I have tried Megatron-LM (which is also part of the code source of the DeepSpeed) For your reference, the latest version of Megatron-LM code can't reproduce the results on the SQUAD dataset either (its accuracy can only reach about 60% with BERT-base).

Good luck!

Jun 16 '22 15:06 haolin-nju

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

unable to prodcude bing_bert with nvidia data

DeepSpeedExamples DeepSpeedExamples copied to clipboard

unable to prodcude bing_bert with nvidia data

DeepSpeedExamples
DeepSpeedExamples copied to clipboard