Copy data from GPU to CPU only once in batched_hyps_to_hypotheses for each tensor of interest.
Previously, several small GPU->CPU copies were used, which caused excess latency linearly proportional to the batch size. For small copies, it is much more efficient to do a single memory copy.
There is still a remaining issue. pack_hypotheses() appears to be copying data from GPU to CPU for the dec_state tensor, one at a time. I would like to remove that, but that is an intrusive change that affects interfaces, so I will not do it for now.
RTFx before this change on librispeech test other: 1766.645297158031 RTFx after this change on librispeech test other: 1795.3508087085368
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.