pyannote-audio memory optimizations for pyannote.audio.core.inference.Inference.aggregate()

While diarizing long audio recordings (>6 hours), I noticed very high memory usage, upwards of 30GB. I tracked the spike to pyannote.audio.core.inference.Inference.aggregate(), which was initializing several very large tensors.

With this PR, RAM usage is reduced by 10 - 15 GB for long audio files in my tests. I have not tested extensively, but I do not believe this impacts accuracy or speed.

I did have one question related to one of the commits,

currently, frames is recreated only so that it has the same start as chunks, but from my understanding, there are no cases where chunks.start and frames.start would be anything other than 0.0.

Is this a correct assumption? Otherwise, frames should be reinitialized.

Now, the whole speaker diarization pipeline does not peak past 20GB of RAM for a 9hr recording. this is constrained by both Inference.aggregate and scipy.cluster.hierarchy.linkage in the AgglomerativeClustering pipeline.

May 17 '24 22:05 benniekiss

rebased the changes onto most recent develop, and then fixed an incorrect git authorship config on my end

May 18 '24 12:05 benniekiss

rebased and added back the frames section.

May 23 '24 13:05 benniekiss

Merged! 🎉 Thanks a lot for your contribution. Will be part of next release.

May 28 '24 12:05 hbredin

Awesome! I really appreciate your work. pyannote has become an invaluable tool, so I'm glad I can give back in my small way.

May 28 '24 13:05 benniekiss

I'd love to know more about how pyannote impacts your work. Feel free to drop me an email!

May 28 '24 15:05 hbredin