What does this PR do ?

Even after addition of long form clustering feature, the code was failing because get_argmin_mat acted as a chokepoint and consumed too much memory for a mathematical operation that was not that complex. The tiling function ended up creating n^2 size matrices. The current function gives a drastic change in memory consumption with minimal change in running time. Fixes the problem described in OOM Clustering Bug Collection: asr

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

[ ] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation?
[ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ ] New Feature
[ ] Bugfix
[ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

May 04 '24 20:05 AatikaNazneen

I will modify the code so that all the test cases pass soon.

May 08 '24 06:05 AatikaNazneen

Thank you for suggesting this idea. Currently I don't have bandwidth to test this so let me check on this as soon as I have bandwidth to take care of this.
This needs to be tested with long-form and short-form audio clustering.

May 08 '24 18:05 tango4j

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

May 23 '24 01:05 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

May 30 '24 01:05 github-actions[bot]

@tango4j @nithinraok Reminder to please review it

Jun 01 '24 19:06 AatikaNazneen

Hi. @AatikaNazneen Can you provide 4 hour long audio data for testing that is creating the OOM? Also, there is a limit in the audio length size. Please be aware that the NeMo speaker diarization cannot handle infinite length audio. Even if it is long form clustering method, the length of audio is capped by GPU memory.

Jun 17 '24 16:06 tango4j

haven't tested this file but an example for 4 hours podcast that can used for testing.. https://www.youtube.com/watch?v=HoRMD9NtKYc

Jun 19 '24 10:06 tttalshaul

I used this one: https://www.youtube.com/watch?v=0OhTQnDp1yw I was able to do this diarize this on sagemaker g4 instance which has 16 GB GPU RAM and 16 GB CPU RAM after this change.

Jun 19 '24 17:06 AatikaNazneen

@AatikaNazneen I was not able to find a way to download the wav file from your youtube link. Could you please extract wav file and upload it through Googledrive or dropbox account? Also, reviewing this PR would take some time, since we need to test with other scenarios. Please provide us with the long form audio files you can share, then it will expedite the process to test your PR.

Jun 24 '24 22:06 tango4j

@tango4j You can use this audio or any audio. The main purpose is that the long form clustering was taking up more RAM previously and that has been optimized so it can be tested of any long audio but you would have to limit your resources to test the before and after effect.

Jun 25 '24 07:06 AatikaNazneen

@tango4j Any updates?

Jul 01 '24 06:07 AatikaNazneen

@tango4j @nithinraok

Jul 08 '24 05:07 AatikaNazneen

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Aug 10 '24 01:08 github-actions[bot]

@tango4j @nithinraok Reminder to review it please

Aug 10 '24 14:08 AatikaNazneen

@AatikaNazneen Sorry for the frustration. Let me review this today.

Aug 12 '24 21:08 tango4j

Hi, @AatikaNazneen.

I have run the original main version and your for-loop version. First of all, your analysis that torch.tile can be taking excessive memory is valid. However, we can let the get_argmin_mat by using torch.expand

In [1]: import torch

In [2]: vec = torch.rand(1000, 1).T.to("cuda:0") # 556MB 

In [3]: vec_tile =  vec.tile(10000000,1) # 38704MB, 38148 MB was added 

In [4]: vec_expand =  vec.expand(10000000, 1000) # 0MB was added

Now, if we change the get_argmin_mat line to these:

curr_mat = curr_scale_anchor.expand(base_scale_anchor.shape[0], -1)
base_mat = base_scale_anchor.expand(curr_scale_anchor.shape[0], -1).t()

These are the time taken for three 10 min audio files.

torch.expand: 0.0266 seconds torch.tile: 0.0568 seconds For Loop: 0.1970 seconds

Bottom line is, we need to modify this to use expand, instead of adding for loop. For loop dramatically increases the Time complexity from O(S) to O(NS) where S is number of scales and N is the frame length in base-scale.

Let me push the torch.expand version of code and please check if there is still memory issue. As you can see in the torch example I showed above, there should be no memory choke point with torch.expand.

Aug 14 '24 18:08 tango4j

@AatikaNazneen @tango4j are more checks waiting to be reviewed?

Aug 20 '24 12:08 tttalshaul

@tango4j Your modification did not do the work for me and i still get OOM error. The reason is that torch.expand doesnt increase memory but when we do the subtraction operation, it occupies the double the memory for both the vectors. The for loop reduces the memory consumption by breaking the problem into a smaller chunk so the peak memory occupied is much lesser as compared to without the for loop Screenshot 2024-08-26 at 4 25 38 PM

Aug 26 '24 11:08 AatikaNazneen

@tango4j If you're convinced then please revert it back to the for loop version. and the rest of the labels needed to run all the checks as well.

Aug 26 '24 11:08 AatikaNazneen

The time complexity is theoretically the same but using a for loop, the previous operation memory is freed (hence the increase in time) but it significantly reduces the space complexity. Using the expand function, we need 2 (n^2) sizes at a particular memory and using the for loop we use n * (2n) memory but at a particular time, we only need 2n memory so we dont get OOM. Let me know if you need further clarification

Aug 26 '24 11:08 AatikaNazneen

@AatikaNazneen I still do not get the idea of for loop and expand function getting the same complexity. Both in theory and experiments, for loop is increasing the running time. It is array assignments versus sequential process so for loop is obviously takes more time. and then if you implement it, it also shows slower speed.

I missed the part where expand-approach causes memory consumption when it performs subtractions.

So in short, I think we should create an option to use forloop to handle extremely long audio. We cannot increase the time complexity to xN steps just for rarely handled extremely long audio files.

Aug 26 '24 21:08 tango4j

@tango4j We can go for that implementation when we use long-form-clustering pipeline.

I missed the part where expand-approach causes memory consumption when it performs subtractions.

If you take a look at line 13 in the image I shared, the memory increment is twice the amount increment in line 9 Also, I tested the 5 hour audio of sagemaker with mlg4dn and it was failing with the expand solution

Aug 27 '24 05:08 AatikaNazneen

@AatikaNazneen I can see that line 13 in your screenshot shows the additional memory usage during the subtraction operation of the copied tensors. I neglected this part when I was testing. This is why we are not saving much on memory peak usage compared to "tile" method.

So at the end of the day, I believe this is all about memory-time trade off. I think, as you suggested, we can let the system use "for loop" for long form audio clustering cases. That way we can only value memory efficiency for long form audio clustering and reduce the peak memory usage.

I will add this option shortly to long form audio clustering path.

standard: expand (speed oriented)
long form audio: for loop (memory oriented)

Aug 27 '24 11:08 tango4j

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Sep 11 '24 01:09 github-actions[bot]

When do you think all of this will be completed? Thanks.. @tango4j @nithinraok

Sep 11 '24 08:09 tttalshaul

@AatikaNazneen @tttalshaul Sorry that I have been traveling overseas for conference. Will work on this PR starting today and finalize this week.

Sep 17 '24 21:09 tango4j

reminding 🙏🏿

Sep 23 '24 08:09 tttalshaul

@AatikaNazneen @tttalshaul I pushed the change to the @AatikaNazneen 's Repo. This version is using expand for less than 10K embedding frames then use for-loop for more than 10K embeddings.

Sep 27 '24 23:09 tango4j

@tango4j Thank you. Would it be an official option in main NeMo repo/pip package?

Sep 29 '24 09:09 tttalshaul

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Oct 14 '24 02:10 github-actions[bot]

NeMo
NeMo copied to clipboard

Fixed chokepoint in diarization for longer audios

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

NeMo NeMo copied to clipboard

Fixed chokepoint in diarization for longer audios

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

NeMo
NeMo copied to clipboard