NeMo Copy data from GPU to CPU only once in batched_hyps_to_hypotheses for each tensor of interest.

Previously, several small GPU->CPU copies were used, which caused excess latency linearly proportional to the batch size. For small copies, it is much more efficient to do a single memory copy.

There is still a remaining issue. pack_hypotheses() appears to be copying data from GPU to CPU for the dec_state tensor, one at a time. I would like to remove that, but that is an intrusive change that affects interfaces, so I will not do it for now.

RTFx before this change on librispeech test other: 1766.645297158031 RTFx after this change on librispeech test other: 1795.3508087085368

Aug 06 '24 20:08 galv

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Aug 22 '24 01:08 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

Aug 29 '24 01:08 github-actions[bot]