pydub icon indicating copy to clipboard operation
pydub copied to clipboard

Improved detect silence

Open lumip opened this issue 2 years ago • 4 comments
trafficstars

Overview

Reimplementation of detect_silence: Previously this function would invoke RMS computations independently for each slice of min_silence_len in the given audio segment, which leads to a lot of recomputing of similar values of the seek_step is small. The new implementation avoids this, resulting in much smaller detection time.

Caveats

This introduces numpy as a new dependency. This is for two reasons:

  1. it makes the computation easy to express
  2. it is very performant due to numpy being highly optimized for computations on large numeric arrays

While implementing this without numpy would be possible, it would likely not see the same amount of performance increase and easy of implementation.

detect_silence previously used audioop to compute RMS values of slices, which rounds the computed value down to the nearest integers - the silence threshold is not rounded. This is no longer the case in the new implementation, resulting in some slices that were previously detected as silent to not be so anymore. In practice this means that detected silent regions might be slightly shorter than before (by usually one or two seek_steps).

Performance results

%timeit results on audio segments consisting mostly of silence

20 minute segment

# old
> %timeit detect_silence(aus_short, silence_thresh=-50, seek_step=1)
1min 36s ± 914 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# new
> %timeit detect_silence(aus_short, silence_thresh=-50, seek_step=1)
2.66 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

~114 minute segment

# old
> %timeit detect_silence(aus, silence_thresh=-50, seek_step=1)
8min 37s ± 10.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# new
> %timeit detect_silence(aus, silence_thresh=-50, seek_step=1)
15 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

lumip avatar Jul 23 '23 15:07 lumip

It does not seem to work properly. When ran on YouTube video (~4h length) with: split_on_silence(audio_segment, min_silence_len=800, keep_silence=True))

It returns the following ranges (just 4 segments): (0, 6812736, 6812736, 13615464, 13615464, 13635080, 13635080, 13677621)

When running the same file with the same arguments (min_silence_len=800, silence_thresh=-16) in Audacity it finds lots and lots of silence (and I can confirm at glance that those findings are correct): image

emsi avatar Feb 05 '24 15:02 emsi

Hey, sorry I saw your responses a bit late just now. Could you perhaps provide a link to the video in question so that I can have a look?

lumip avatar Feb 24 '24 16:02 lumip

I believe I was processing the audio from this video:

https://youtu.be/AY9MnQ4x3zk

BTW: I've used ffmpeg eventually. Super fast and accurate.

emsi avatar Feb 24 '24 17:02 emsi

It does not seem to work properly. When ran on YouTube video (~4h length) with: split_on_silence(audio_segment, min_silence_len=800, keep_silence=True))

It returns the following ranges (just 4 segments): (0, 6812736, 6812736, 13615464, 13615464, 13635080, 13635080, 13677621)

When running the same file with the same arguments (min_silence_len=800, silence_thresh=-16) in Audacity it finds lots and lots of silence (and I can confirm at glance that those findings are correct): image

To come back to this, I first want to point out that the changes made in this PR match the regions of silence found by the current implementation in pydub overall fairly well, although there were some larger deviations that I might look into a bit more, but I think these are all explained by the caveats I already pointed out.

With regards to the discrepancy with audacity and ffmpeg: If I run detect_silence with silence_thresh=-32 I obtain results that also reasonably match those produced by ffmpeg with threshold -16. pydub's db_to_float conversion applies different conversion based on whether the using_amplitude keyword argument is True or not - in one case an initial division of the passed in decibel value is a factor of 2 larger than in the other, so I believe that there is a difference in the interpretation of the dB value between pydub's silence detection and that of audacity and ffmpeg. I tried to figure out which one would be more canocical, but I couldn't find reliable definitions for dBFS that do not contradict each other.

lumip avatar Feb 29 '24 20:02 lumip