Megatron-LM [QUESTION] why the time of one iter in nsys longer than that in the ouput log?

[QUESTION] why the time of one iter in nsys longer than that in the ouput log?

Open hanwen-sun opened this issue 11 months ago • 1 comments

I want to compare the speed of training llama2-7b between libai(https://github.com/Oneflow-Inc/libai) and Megatron-LM in NVIDIA A800-SXM4-80G. But I find the time of one iter in nsys is longer than the output in log when using Megatron-LM;

the log time is:

 iteration      200/    1000 | consumed samples:          200 | elapsed time per iteration (ms): 183.7 | learning rate: 9.375E-06 | global batch size:     1 | lm loss: 7.889984E+00 | loss scale: 1.0 | grad norm: 4.921 | number of skipped iterations:   0 | number of nan iterations:   0 |

the nsys time is: and I can't find many gap in the cuda stream.

Can anyone explain this to me?

Mar 14 '24 11:03 hanwen-sun

Megatron-LM Megatron-LM copied to clipboard

[QUESTION] why the time of one iter in nsys longer than that in the ouput log?

Megatron-LM
Megatron-LM copied to clipboard