[QUESTION] Performance Impact of Using item() in `total_num_tokens += num_tokens.item()` in megatron/core/pipeline_parallel/schedules.py
Hi Megatron-LM team!
While going through the code in megatron/core/pipeline_parallel/schedules.py, I noticed that between each forward and backward pass, the line total_num_tokens += num_tokens.item() uses the item() method.
https://github.com/NVIDIA/Megatron-LM/blob/8ca9e57f9d0bb93fc61850ebdccb6b6e6fa36b64/megatron/core/pipeline_parallel/schedules.py#L451-L467
From my understanding, the item() method transfers data from the GPU device to the host, which could cause the CPU to block and wait for the GPU to finish its computation. This might have a negative impact on performance, as illustrated below.
To validate this, I removed the item() method and observed that the time cost associated with this operation was completely eliminated.
Could you clarify why item() is used here?
Thanks for your time and insights!
Hi, wan-nan, Thanks for looking into it. This is being addressed in an internal MR.
Marking as stale. No activity in 60 days.
This issue is fixed with the following commit https://github.com/NVIDIA/Megatron-LM/commit/87d9d2506acefaf3bd617b27ebbd24c7ddfcea5c
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.