DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

unable to prodcude bing_bert with nvidia data

Open 1024er opened this issue 3 years ago • 11 comments

I used the default setting and run the code with 32 V100, the data was constructed from nvidia scripts. I was able to reproduce nvidia's results on Squad (F1=90), but failed to reoroduce deepspeed results.

stage 1, maxlen 128 (epoch 0-16) image stage 2, maxlen 512 (epoch 17-20) image

I evaluated the epoch20 checkpoint on BingBertSquad, and the results were unexpected low: F1=21

Thank you .

1024er avatar Jan 08 '22 05:01 1024er

@jeffra @tjruwase

1024er avatar Feb 08 '22 15:02 1024er

@1024er, apologies for the delay. Will take a closer look asap.

tjruwase avatar Feb 09 '22 20:02 tjruwase

@1024er, apologies for the delay. Will take a closer look asap.

Thank you so much ~ Is there any progress please? :)

1024er avatar Feb 19 '22 11:02 1024er

@1024er, apologies I have not had much time to explore this.

tjruwase avatar Feb 19 '22 13:02 tjruwase

@1024er, apologies I have not had much time to explore this.

Hi,do you have time to take a look recently ? Thank you @tjruwase

1024er avatar Mar 18 '22 09:03 1024er

Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to this issue)?

If so, I wonder if it is convenient for you to share your solution? If not, could you please tell me the DeepSpeed version and the Nvidia docker version you used in your experiments?

haolin-nju avatar Apr 25 '22 12:04 haolin-nju

Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to this issue)?

If so, I wonder if it is convenient for you to share your solution? If not, could you please tell me the DeepSpeed version and the Nvidia docker version you used in your experiments?

This is due to the inconsistent amount of data on each gpu. This problem can be solved by processing the same amount of data per shard.

1024er avatar May 15 '22 17:05 1024er

Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to this issue)? If so, I wonder if it is convenient for you to share your solution? If not, could you please tell me the DeepSpeed version and the Nvidia docker version you used in your experiments?

This is due to the inconsistent amount of data on each gpu. This problem can be solved by processing the same amount of data per shard.

Thank you for your reply. I'll have a try and share my results asap.

haolin-nju avatar May 16 '22 02:05 haolin-nju

Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to this issue)? If so, I wonder if it is convenient for you to share your solution? If not, could you please tell me the DeepSpeed version and the Nvidia docker version you used in your experiments?

This is due to the inconsistent amount of data on each gpu. This problem can be solved by processing the same amount of data per shard.

Thank you for your reply. I'll have a try and share my results asap.

Have you encountered the problem with a very low F value mentioned above ?

1024er avatar Jun 11 '22 07:06 1024er

Hi @1024er I'd suggest to double check whether the checkpoints are loaded successfully or not by checking the return values of the model.load_state_dict calls in https://github.com/microsoft/DeepSpeedExamples/blob/master/BingBertSquad/nvidia_run_squad_deepspeed.py#L828

There should be 2~4 missing keys depending on whether you use LM or not. If you see more than that, that's the problem

HeyangQin avatar Jun 14 '22 22:06 HeyangQin

Hi, @1024er . Sorry for late reply. I've tried your method in the last few weeks. However, DeepSpeed still have some other bugs such as optimizer loss becomes NaN in fp16 mode.

I'm not working on this framework because I can't reproduce DeepSpeed's BERT results for almost two months. However, from my experience, I have tried Megatron-LM (which is also part of the code source of the DeepSpeed) For your reference, the latest version of Megatron-LM code can't reproduce the results on the SQUAD dataset either (its accuracy can only reach about 60% with BERT-base).

Good luck!

haolin-nju avatar Jun 16 '22 15:06 haolin-nju