Megatron-LM [QUESTION] Performance Impact of Using item() in `total_num_tokens += num_tokens.item()` in megatron/core/pipeline

Hi Megatron-LM team!

While going through the code in megatron/core/pipeline_parallel/schedules.py, I noticed that between each forward and backward pass, the line total_num_tokens += num_tokens.item() uses the item() method.

https://github.com/NVIDIA/Megatron-LM/blob/8ca9e57f9d0bb93fc61850ebdccb6b6e6fa36b64/megatron/core/pipeline_parallel/schedules.py#L451-L467

From my understanding, the item() method transfers data from the GPU device to the host, which could cause the CPU to block and wait for the GPU to finish its computation. This might have a negative impact on performance, as illustrated below.

To validate this, I removed the item() method and observed that the time cost associated with this operation was completely eliminated.

Could you clarify why item() is used here?

Thanks for your time and insights!

Feb 13 '25 06:02 wan-nan

Hi, wan-nan, Thanks for looking into it. This is being addressed in an internal MR.

Mar 25 '25 06:03 shifangx

Marking as stale. No activity in 60 days.

May 24 '25 18:05 github-actions[bot]

This issue is fixed with the following commit https://github.com/NVIDIA/Megatron-LM/commit/87d9d2506acefaf3bd617b27ebbd24c7ddfcea5c

May 30 '25 04:05 shifangx

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jul 29 '25 02:07 github-actions[bot]

[QUESTION] Performance Impact of Using item() in `total_num_tokens += num_tokens.item()` in megatron/core/pipeline_parallel/schedules.py