NeMo
NeMo copied to clipboard
Fixed chokepoint in diarization for longer audios
What does this PR do ?
Even after addition of long form clustering feature, the code was failing because get_argmin_mat acted as a chokepoint and consumed too much memory for a mathematical operation that was not that complex. The tiling function ended up creating n^2 size matrices. The current function gives a drastic change in memory consumption with minimal change in running time. Fixes the problem described in OOM Clustering Bug Collection: asr
Changelog
- Add specific line by line info of high level changes in this PR.
Usage
- You can potentially add a usage example below
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
- [ ] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you add or update any necessary documentation?
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [ ] New Feature
- [ ] Bugfix
- [ ] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
- Related to # (issue)
I will modify the code so that all the test cases pass soon.
Thank you for suggesting this idea.
Currently I don't have bandwidth to test this so let me check on this as soon as I have bandwidth to take care of this.
This needs to be tested with long-form and short-form audio clustering.
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.
@tango4j @nithinraok Reminder to please review it
Hi. @AatikaNazneen Can you provide 4 hour long audio data for testing that is creating the OOM? Also, there is a limit in the audio length size. Please be aware that the NeMo speaker diarization cannot handle infinite length audio. Even if it is long form clustering method, the length of audio is capped by GPU memory.
haven't tested this file but an example for 4 hours podcast that can used for testing.. https://www.youtube.com/watch?v=HoRMD9NtKYc
I used this one: https://www.youtube.com/watch?v=0OhTQnDp1yw I was able to do this diarize this on sagemaker g4 instance which has 16 GB GPU RAM and 16 GB CPU RAM after this change.
@AatikaNazneen I was not able to find a way to download the wav file from your youtube link. Could you please extract wav file and upload it through Googledrive or dropbox account? Also, reviewing this PR would take some time, since we need to test with other scenarios. Please provide us with the long form audio files you can share, then it will expedite the process to test your PR.
@tango4j You can use this audio or any audio. The main purpose is that the long form clustering was taking up more RAM previously and that has been optimized so it can be tested of any long audio but you would have to limit your resources to test the before and after effect.
@tango4j Any updates?
@tango4j @nithinraok
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
@tango4j @nithinraok Reminder to review it please
@AatikaNazneen Sorry for the frustration. Let me review this today.
Hi, @AatikaNazneen.
I have run the original main version and your for-loop version.
First of all, your analysis that torch.tile
can be taking excessive memory is valid.
However, we can let the get_argmin_mat
by using torch.expand
In [1]: import torch
In [2]: vec = torch.rand(1000, 1).T.to("cuda:0") # 556MB
In [3]: vec_tile = vec.tile(10000000,1) # 38704MB, 38148 MB was added
In [4]: vec_expand = vec.expand(10000000, 1000) # 0MB was added
Now, if we change the get_argmin_mat
line to these:
curr_mat = curr_scale_anchor.expand(base_scale_anchor.shape[0], -1)
base_mat = base_scale_anchor.expand(curr_scale_anchor.shape[0], -1).t()
These are the time taken for three 10 min audio files.
torch.expand: 0.0266 seconds torch.tile: 0.0568 seconds For Loop: 0.1970 seconds
Bottom line is, we need to modify this to use expand, instead of adding for loop. For loop dramatically increases the Time complexity from O(S) to O(NS) where S is number of scales and N is the frame length in base-scale.
Let me push the torch.expand
version of code and please check if there is still memory issue.
As you can see in the torch example I showed above, there should be no memory choke point with torch.expand
.
@AatikaNazneen @tango4j are more checks waiting to be reviewed?
@tango4j Your modification did not do the work for me and i still get OOM error.
The reason is that torch.expand doesnt increase memory but when we do the subtraction operation, it occupies the double the memory for both the vectors. The for loop reduces the memory consumption by breaking the problem into a smaller chunk so the peak memory occupied is much lesser as compared to without the for loop
@tango4j If you're convinced then please revert it back to the for loop version. and the rest of the labels needed to run all the checks as well.
The time complexity is theoretically the same but using a for loop, the previous operation memory is freed (hence the increase in time) but it significantly reduces the space complexity. Using the expand function, we need 2 (n^2) sizes at a particular memory and using the for loop we use n * (2n) memory but at a particular time, we only need 2n memory so we dont get OOM. Let me know if you need further clarification
@AatikaNazneen I still do not get the idea of for loop and expand function getting the same complexity. Both in theory and experiments, for loop is increasing the running time. It is array assignments versus sequential process so for loop is obviously takes more time. and then if you implement it, it also shows slower speed.
I missed the part where expand-approach causes memory consumption when it performs subtractions.
So in short, I think we should create an option to use forloop to handle extremely long audio. We cannot increase the time complexity to xN steps just for rarely handled extremely long audio files.
@tango4j We can go for that implementation when we use long-form-clustering pipeline.
I missed the part where expand-approach causes memory consumption when it performs subtractions.
If you take a look at line 13 in the image I shared, the memory increment is twice the amount increment in line 9 Also, I tested the 5 hour audio of sagemaker with mlg4dn and it was failing with the expand solution
@AatikaNazneen I can see that line 13 in your screenshot shows the additional memory usage during the subtraction operation of the copied tensors. I neglected this part when I was testing. This is why we are not saving much on memory peak usage compared to "tile" method.
So at the end of the day, I believe this is all about memory-time trade off. I think, as you suggested, we can let the system use "for loop" for long form audio clustering cases. That way we can only value memory efficiency for long form audio clustering and reduce the peak memory usage.
I will add this option shortly to long form audio clustering path.
- standard: expand (speed oriented)
- long form audio: for loop (memory oriented)
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
When do you think all of this will be completed? Thanks.. @tango4j @nithinraok
@AatikaNazneen @tttalshaul Sorry that I have been traveling overseas for conference. Will work on this PR starting today and finalize this week.
reminding 🙏🏿
@AatikaNazneen @tttalshaul I pushed the change to the @AatikaNazneen 's Repo. This version is using expand for less than 10K embedding frames then use for-loop for more than 10K embeddings.
@tango4j Thank you. Would it be an official option in main NeMo repo/pip package?
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.