Megatron-LM
Megatron-LM copied to clipboard
[QUESTION] why the time of one iter in nsys longer than that in the ouput log?
I want to compare the speed of training llama2-7b between libai(https://github.com/Oneflow-Inc/libai) and Megatron-LM in NVIDIA A800-SXM4-80G. But I find the time of one iter in nsys is longer than the output in log when using Megatron-LM;
- the log time is:
iteration 200/ 1000 | consumed samples: 200 | elapsed time per iteration (ms): 183.7 | learning rate: 9.375E-06 | global batch size: 1 | lm loss: 7.889984E+00 | loss scale: 1.0 | grad norm: 4.921 | number of skipped iterations: 0 | number of nan iterations: 0 |
- the nsys time is:
and I can't find many gap in the cuda stream.
Can anyone explain this to me?