NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

[magpietts][lhotse_v2] Adding scripts of converting NeMo manifests to Lhotse Shars, and speedup improvements for codec model inference.

Open XuesongYang opened this issue 6 months ago • 9 comments

Speed Up Codec Model Inference

Made slightly changes for the dependency util func mask_sequence_tensor. We observed at least 3x speedup. This change is cherry-picked from the part of PR https://github.com/NVIDIA/NeMo/pull/12617

Lhotse Shar v2

Three scripts were added to preprocess NeMo manifests and convert them into Lhotse Shars (cuts/audio/codes). This new version reduced significantly computation overheads by using rank balanced workloading and writing independently. This process involves four main steps.

  1. Extend Manifests with Context Audio (on GPU nodes): Enhance the NeMo manifests by adding context audio information.

    • The old recipe saved individual speaker embedding files and load again to compute speaker similarity. It is not computation scaling-up friendly.
    • This new recipe runs speaker embedding extraction on the fly and applies torch.matmul to compute similarity matrix. It recursively find the next best context audio if the 1-best candidate does not survive, resulting in preserving more data records.
    • Scaling up computation friendly without access IPC cost. It pre-allocates a distinct subset of speaker records to each GPU rank in consideration of rank workload balance by using greedy bin-packing strategy. Tried round-robin but not ideal for this task.
    • Buffered writing manifest only when the buffer is filled.
  2. Create Lhotse Shards (on CPU nodes): Convert the extended NeMo manifests into Lhotse shards.

    • Processes a chunk of manifest entries, loads audio, and writes corresponding single shard files for cuts, target audio, and context audio.
    • Designed to be run in a parallel worker process.
    • Loads and writes audio iteratively to save memory.
  3. Extend Shards with Audio Codes (on GPU nodes): Process the Lhotse shards to extract and include audio codes (audio codec extraction).

    • pre-allocate Lhotse shards to each rank, and each rank process and write independently.
    • Pre-allocate padded audio tensors instead of run for..loop of torch.func.pad
    • avoided running zero pads 4x times that was observed in old recipe where first padding to the multiples of samples per frame, second padding to the max len of audio, and both padding were applied inside the codec inference again.

Note: (for internal users) each python script is wrapped as a slurm job sub file that can process multiple datasets at once. Refer this link for details.

XuesongYang avatar Jun 04 '25 00:06 XuesongYang

part of the code has been reviewed by @pzelasko offline. Piotr, could you pls review the scripts again if you have time?

XuesongYang avatar Jun 04 '25 17:06 XuesongYang

Can we close #13548?

blisc avatar Jun 04 '25 20:06 blisc

Can we close #13548?

Yes, we can close that. It is no longer needed.

paarthneekhara avatar Jun 04 '25 21:06 paarthneekhara

@XuesongYang please merge main into this branch again to resolve some issues with getting the CI to run

chtruong814 avatar Jun 04 '25 23:06 chtruong814

Looks good to me. We can add the above PR description to a README.

paarthneekhara avatar Jun 05 '25 07:06 paarthneekhara

This is a heavy lift, nice work, @XuesongYang !

rfejgin avatar Jun 05 '25 21:06 rfejgin

@XuesongYang please merge main into this branch again to resolve some issues with getting the CI to run

sorry not catch what you suggested. Our code branch magpietts_2503 is far behind the latest main branch. Rebasing to the latest main will be risky for our experiments for now.

XuesongYang avatar Jun 07 '25 01:06 XuesongYang

Looks good to me. We can add the above PR description to a README.

yeah, will do.

XuesongYang avatar Jun 07 '25 01:06 XuesongYang

Looks good to me. We can add the above PR description to a README.

@paarthneekhara I just added the README file. Please have a look.

XuesongYang avatar Jun 11 '25 00:06 XuesongYang