NeMo
NeMo copied to clipboard
[magpietts][lhotse_v2] Adding scripts of converting NeMo manifests to Lhotse Shars, and speedup improvements for codec model inference.
Speed Up Codec Model Inference
Made slightly changes for the dependency util func mask_sequence_tensor. We observed at least 3x speedup. This change is cherry-picked from the part of PR https://github.com/NVIDIA/NeMo/pull/12617
Lhotse Shar v2
Three scripts were added to preprocess NeMo manifests and convert them into Lhotse Shars (cuts/audio/codes). This new version reduced significantly computation overheads by using rank balanced workloading and writing independently. This process involves four main steps.
-
Extend Manifests with Context Audio (on GPU nodes): Enhance the NeMo manifests by adding context audio information.
- The old recipe saved individual speaker embedding files and load again to compute speaker similarity. It is not computation scaling-up friendly.
- This new recipe runs speaker embedding extraction on the fly and applies
torch.matmulto compute similarity matrix. It recursively find the next best context audio if the 1-best candidate does not survive, resulting in preserving more data records. - Scaling up computation friendly without access IPC cost. It pre-allocates a distinct subset of speaker records to each GPU rank in consideration of rank workload balance by using greedy bin-packing strategy. Tried round-robin but not ideal for this task.
- Buffered writing manifest only when the buffer is filled.
-
Create Lhotse Shards (on CPU nodes): Convert the extended NeMo manifests into Lhotse shards.
- Processes a chunk of manifest entries, loads audio, and writes corresponding single shard files for cuts, target audio, and context audio.
- Designed to be run in a parallel worker process.
- Loads and writes audio iteratively to save memory.
-
Extend Shards with Audio Codes (on GPU nodes): Process the Lhotse shards to extract and include audio codes (audio codec extraction).
- pre-allocate Lhotse shards to each rank, and each rank process and write independently.
- Pre-allocate padded audio tensors instead of run for..loop of torch.func.pad
- avoided running zero pads 4x times that was observed in old recipe where first padding to the multiples of samples per frame, second padding to the max len of audio, and both padding were applied inside the codec inference again.
Note: (for internal users) each python script is wrapped as a slurm job sub file that can process multiple datasets at once. Refer this link for details.
part of the code has been reviewed by @pzelasko offline. Piotr, could you pls review the scripts again if you have time?
Can we close #13548?
Can we close #13548?
Yes, we can close that. It is no longer needed.
@XuesongYang please merge main into this branch again to resolve some issues with getting the CI to run
Looks good to me. We can add the above PR description to a README.
This is a heavy lift, nice work, @XuesongYang !
@XuesongYang please merge main into this branch again to resolve some issues with getting the CI to run
sorry not catch what you suggested. Our code branch magpietts_2503 is far behind the latest main branch. Rebasing to the latest main will be risky for our experiments for now.
Looks good to me. We can add the above PR description to a README.
yeah, will do.
Looks good to me. We can add the above PR description to a README.
@paarthneekhara I just added the README file. Please have a look.