lhotse
lhotse copied to clipboard
Silero VAD for cleaning the dataset from silence
I intend to add a new workflow to Lhotse for processing arbitrary audio datasets by removing silence and preserving only speech using the Silero VAD, which can accurately detect speech in an audio stream. The workflow I'm adding should help users quickly and efficiently convert arbitrary datasets by cutting out silence and retaining only speech. An important aspect of such a process is the ability to preserve all supervision for each segment while considering changes made to the audio file. Before accepting this PR, I invite you to review my code. Currently, the code handles the task in trivial conditions, only processing MonoCut
objects and not supporting other Cut
types. I want to add support for other Cut
types, but I'm not sure about the best approach at the moment. I would appreciate your comments and suggestions for improving the code. I would also be glad if you could try running the code and share your impressions. I'm confident that your feedback and suggestions will help make it even better.
Key Changes
-
Added the
speach_only
function, which allows processing audio files by removing silence and preserving speech only. -
Added the
speech_only
workflow, which enables processing datasets from the CLI. -
The code is written with the intention of being usable in various scenarios similar in concept to the addressed task.
Issues Requiring Discussion
There are several places in the code where I'm uncertain about the choice of implementation. In these places, I use NotImplementedError
to indicate that I need assistance in selecting the best implementation approach. This is mainly related to handling subclasses of the Cut
class other than MonoCut
. I'm not sure about the best way to handle these cases.
Additionally, I have the _to_mono
function, which should convert Recording records to mono format for speech analysis using Silero VAD. I'm confident that there's an elegant way to do this, so please provide some guidance.
I would like to receive feedback on function naming, variable naming, and code architecture. If you have specific suggestions for improvement, I would be glad to hear them.
In particular, I don't really like the name speach_only
for the workflow. Any suggestions for a more appropriate name?
Thanks! I'll review it tomorrow, but before I do -- based on your description, it looks like a similar outcome may be achieved by running the activity_detection
workflow and then calling cuts = cuts.trim_to_supervisions()
on the result (we also have trim_to_unsupervised_segments
as a complementary operation), that already supports multi-channel data properly. Can you explain what do you expect to be different here?
As a note regarding mono vs multi channel: I think it makes sense to load and process each channel separately with VAD, and assign the resulting supervision to the right channel ID. Think of cases such as phone/online calls where speech activity is clearly channel dependent.
The main purpose of this workflow is to quickly resave the dataset by cutting out anything that is not voice from the audio files. This should fix the supervision sections so that all timestamps and durations change in concert with the change in the audio file. This is really similar to trim_to_supervisions by VAD annotation, and my first idea was to use this, but I couldn't find a nice way to solve this using standard tools.
I think this workflow can be recreated with the existing operations as follows:
# pseudo-code workflow
recordings = RecordingSet(...) # N recordings
supervisions = activity_detection(recordings) # M supervisions
cuts = CutSet.from_manifests(recordings, supervisions) # N cuts with 0+ supervision each
cuts = cuts.trim_to_supervisions() # M cuts with exactly 1 supervision
cuts = cuts.save_audios(out_dir) # [optional] same as above but the audio fragments were hard-copied on disk
cuts.to_file(...) # save the manifest
Note that once you create a cutset, the supervision time boundaries are always relative to the start of the cut. After trim_to_supervisions
, supervisions generally have start=0
and duration=cut.duration
.
It would be helpful if you can indicate what requirements do you have that are missing from the above.
I understand your point about using existing operations, but your proposed method and my workflow have different goals and functions.
My proposed workflow aims to synchronously remove all silence segments from audio files and precisely refine the supervisions for each segment. This process not only reduces the size of the audio files, but also completely rebuilds the dataset, making it more accurate. This is useful when the user wants to apply a task-specific activity detector and its partitioning is coarse enough to require refinement.
In the proposed pseudo-code, we simply detect activity in the original audio files and then trim the segments to the appropriate activity time bounds. This is only convenient for the task of partitioning a dataset into segments, e.g., for model training. However, this method does not allow the observation and audio file to be put together correctly.
The workflow I have proposed could be a starting point for creating more complex pipelines that use different activity detectors to refine the supervision and repackage/unpackage the dataset. For example, one user might want to apply a Spanish speech detector, while another might want to apply a dog bark detector.
I propose to consider my workflow as an auxiliary recipe for recompiling a dataset with supervisory refinement. I think it would be quite difficult to implement such a pipelining by standard means. In any case, it would require
Let's imagine that I have some one particular Cut. It is labeled, on the audio recording some people are having a non-stormy conversation with occasional silence, I want to turn this Cut into a new Cut of the same kind but removing all the silence sections. It is important for me not to lose the labels of the overlapping sections of the supervision. It is also important that the supervisions that had a long pause be refined to match the audio. In this scenario it is not enough for me to simply split one Cut into many, I also need an algorithmic basis for putting it all back together. And most likely solving such a problem, I do not want to be able to transform back to the original dataset. I will be more than satisfied when solving such a problem to dump the distilled dataset to disk and then work with it. For this purpose I propose this workflow.
Let's imagine that I have some one particular Cut. It is labeled, on the audio recording some people are having a non-stormy conversation with occasional silence, I want to turn this Cut into a new Cut of the same kind but removing all the silence sections. It is important for me not to lose the labels of the overlapping sections of the supervision. It is also important that the supervisions that had a long pause be refined to match the audio. In this scenario it is not enough for me to simply split one Cut into many, I also need an algorithmic basis for putting it all back together. And most likely solving such a problem, I do not want to be able to transform back to the original dataset. I will be more than satisfied when solving such a problem to dump the distilled dataset to disk and then work with it. For this purpose I propose this workflow.
This scenario can also be addressed easily using existing Cut
methods such as append
, trim_to_supervision_groups
etc. I don't see the utility in having workflows for very specific use cases, which are otherwise feasible using cut manipulations.
In this contribution, I have made significant improvements to the activity detection process in lhotse. The main goal of these improvements is to provide users with a more versatile and efficient way to process audio data, specifically by removing silence and retaining only speech segments using the Silero VAD model. This workflow simplifies the audio data conversion process by cutting out silence, resulting in a more optimised and accurate dataset.
-
Added trim_inactivity function: Introduced the
trim_inactivity
function (instead of speach_only), which is designed to process audio files with silence removal and preservation of speech segments. This function is the basis of the entire workflow. -
The trim_inactivity workflow is implemented: The
trim_inactivity
workflow is a CLI tool that allows users to process datasets efficiently. It uses thetrim_inactivity
function to trim silence and keep only relevant segments of speech. -
The code is designed to be used in a variety of scenarios, allowing it to be adapted to different use cases beyond the original task.
-
Documentation with examples of use cases has been written
The main objective of this workflow is to efficiently refine and re-save datasets by removing non-speech segments from audio files. This ensures that all the observation information is correctly matched with the audio changes, thus producing a more accurate dataset. While some similar operations can be performed using existing lhotse tools such as trim_to_supervisions
, this workflow offers a more optimised and convenient approach to refine and repackage datasets with a focus on different activity detectors.
Could you give some examples of use-cases that would require the user to create such recordings with detected activities, and which cannot be achieved by using cuts?
The code I am proposing relies entirely on the Cuts functionality to solve the task at hand wherever possible. I don't know how using standard tools it is possible to transform an arbitrary data set in this way. As far as I understand, it is currently impossible to transform a CutSet using a combination of a few elementary actions so that the number of Cuts remains the same, each Cut has the same number of SupervisionSegments, and each segment is refined according to the discarded audio segments within the SupervisionSegments. I would appreciate if you could provide an example of code that removes the desired audio segments from Recording, while refining the supervision in the parent Cut.
The code I am proposing relies entirely on the Cuts functionality to solve the task at hand wherever possible. I don't know how using standard tools it is possible to transform an arbitrary data set in this way. As far as I understand, it is currently impossible to transform a CutSet using a combination of a few elementary actions so that the number of Cuts remains the same, each Cut has the same number of SupervisionSegments, and each segment is refined according to the discarded audio segments within the SupervisionSegments. I would appreciate if you could provide an example of code that removes the desired audio segments from Recording, while refining the supervision in the parent Cut.
What I meant was: what's a use-case where a user would need this workflow rather than just using the cuts obtained from the supervision segments?
An example use case could be the need to refine a set of audio recordings where medical professionals discuss the results of patient medical examinations. These recordings contain medical text reflected in noisy supervision data, which was collected using a semi-automatic method. The peculiarity of this supervision data is that the annotation of one piece of text overlaps with another, and it requires refinement using automatic audio transcription algorithms. Prior to this refinement stage, it is necessary to prepare the audio recordings by removing all background noises and periods of silence, without losing context. The segments that remain after the removal of inactivity should include overlaps in the supervision data for subsequent refinement.
We need to be able to re-save the dataset by cutting out the silence sections, so that when working with such datasets in the future we can be sure that it is clean enough and does not contain background noise. The presence of background noise in the dataset slows down experimentation, inefficiently consumes space on discs, and introduces bias in hypothesis testing.
I think I'm starting to understand what you are trying to achieve. Can you confirm the problem boils down to the following description: Given a cut with N supervisions modify the supervision start and end times according to new external information.
Note that I need to have an understanding what's the high level goal of this before I start reviewing.
If the above statement is true, can this problem be solved using the following actions:
- Run the VAD on a cut and obtain a list of VAD-supervisions.
- Intersect the VAD-supervisions with the original supervisions. Intersection here means creating a new supervision list where the segments cover only the time intervals found in both of the inputs. The result copies all metadata from the original supervision list.
- Update the supervisions in the cut.
If the above interpretation is correct, the only thing we're missing in Lhotse is the implementation of the intersection of two supervision sets. This could be added as a new method on Cut/CutSet, e.g. def refine_supervision_times(self: Cut/CutSet, other: List[Supervision]) -> Cut/CutSet
. I don't think it requires a separate workflow though.
No, unfortunately the problem does not boil down to the description you suggested. Because it does not take into account the need to refine the silence intervals inside the supervisions.
The main task is to clean up the audio recording. We want to get a new Cut
in Recording
of which there will be no silence. At the same time, it is important for us to correctly preserve the entire supervision inside the Cut
. It is important that we don't want to split the original Cut
into CutSet
where each element will contain one SupervisionSegment
. We want a new Cut
that contains all of the original SupervisionSegment
s (except those that are dropped in the deletion procedure).
I think that in addition to the intersection procedure you suggest, you can also use AlignmentItem
to segment the inner speech/silence segments. Either some kind of Recording
masking procedure should be applied. And we also need a procedure to load audio taking into account AlignmentItem
s or audio masking.
An example use case could be the need to refine a set of audio recordings where medical professionals discuss the results of patient medical examinations. These recordings contain medical text reflected in noisy supervision data, which was collected using a semi-automatic method. The peculiarity of this supervision data is that the annotation of one piece of text overlaps with another, and it requires refinement using automatic audio transcription algorithms. Prior to this refinement stage, it is necessary to prepare the audio recordings by removing all background noises and periods of silence, without losing context. The segments that remain after the removal of inactivity should include overlaps in the supervision data for subsequent refinement.
Again, why can this not be done by appending the cuts corresponding to the supervisions? Why does the "filtered" recording need to be saved beforehand, except perhaps for loading efficiency.
Again, why can this not be done by appending the cuts corresponding to the supervisions?
Could you give a concrete example of how exactly we can override one single source SupervisionSegment, given the silence intervals? Without splitting it into duplicates with different offsets and durations.
Why does the "filtered" recording need to be saved beforehand, except perhaps for loading efficiency.
To simplify further work with a cleaner dataset, and save disk space.
Again, why can this not be done by appending the cuts corresponding to the supervisions?
Could you give a concrete example of how exactly we can override one single source SupervisionSegment, given the silence intervals? Without splitting it into duplicates with different offsets and durations.
Why does the "filtered" recording need to be saved beforehand, except perhaps for loading efficiency.
To simplify further work with a cleaner dataset, and save disk space.
┌─────────────────────────────────────────────────────────────────────┐
│ Original recording (with speech and noise) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
.─────────────────────────────.
( Speech activity detection )
`─────────────────────────────'
│
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Cut with 2 supervision segments │
│ ◁─────────────────▷ ◁─────────────────▷ │
└─────────────────────────────────────────────────────────────────────┘
│
│
▼
.───────────────────────────────────────.
( cut.trim_to_supervision_segments() )
`───────────────────────────────────────'
│
┌────────────┴───────────────────┐
│ │
▼ ▼
┌────────────────┐ ┌──────────────────┐
│ Speech cut 1 │ │ Speech cut 2 │
└────────────────┘ └──────────────────┘
│ │
└──────────────┬─────────────────┘
│
▼
.─────────────────────────.
( append() )
`─────────────────────────'
│
▼
┌───────────────────────────────────────┐
│ Combined speech cuts │
└───────────────────────────────────────┘
│
▼
.─────────────────────────.
( cut.save_audio() )
`─────────────────────────'
│
│
│
▼
1 2 34 5 6 7 8 9
┌─────────────────────────────────────────────────────────────────────┐
│ Cut with Supervision │
│ ◁───────────────.───────.▷ │
| ◁───────.───────.───────.─────▷ │─┐
│ . . . ◁─────.─────────────────▷ | │
└─────────────────────.───────.───────.───────.───────────────────────┘ │
. . │ . . │
. . ▼ . . │
.─────────────────────────────. │
( Speech activity detection ) │
`─────────────────────────────' │
. . │ . . │
. . │ . . │
. . ▼ . . │
┌─────────────────────.───────.───────.───────.───────────────────────┐ │
│ . Silence Supervision. │ │
│ ◁───────▷ ◁───────▷ │ │
└─────────────────────.───────.───────.───────.───────────────────────┘ │
. . │ . ┌──────────────────────────────┘
. . ▼ . ▼ .
.──────────────────────────.
( Some kind of procedure )
`──────────────────────────'
. . │ . .
. . ▼ . .
┌─────────────────────.───────.───────.───────.───────────────────────┐
│ . Combined speech cuts . │
│ ◁───────────────/////////▷ ///////// │
| /////////◁──────/////////─────▷ │
│ ///////// /////////◁────────────────▷ |
└─────────────────────────────────────────────────────────────────────┘
How can we describe this with the procedure you suggest?
How can such a resulting Cut be described? Is there any way to guarantee that when loading an audio track with load_audio, the numpy array will be shorter than the original and will not contain silence segments, and only three SupervisionSegments will remain in the cut.supervisions list?
1 2 34 5 6 7 8 9 ┌─────────────────────────────────────────────────────────────────────┐ │ Cut with Supervision │ │ ◁───────────────.───────.▷ │ | ◁───────.───────.───────.─────▷ │─┐ │ . . . ◁─────.─────────────────▷ | │ └─────────────────────.───────.───────.───────.───────────────────────┘ │ . . │ . . │ . . ▼ . . │ .─────────────────────────────. │ ( Speech activity detection ) │ `─────────────────────────────' │ . . │ . . │ . . │ . . │ . . ▼ . . │ ┌─────────────────────.───────.───────.───────.───────────────────────┐ │ │ . Silence Supervision. │ │ │ ◁───────▷ ◁───────▷ │ │ └─────────────────────.───────.───────.───────.───────────────────────┘ │ . . │ . ┌──────────────────────────────┘ . . ▼ . ▼ . .──────────────────────────. ( Some kind of procedure ) `──────────────────────────' . . │ . . . . ▼ . . ┌─────────────────────.───────.───────.───────.───────────────────────┐ │ . Combined speech cuts . │ │ ◁───────────────/////////▷ ///////// │ | /////////◁──────/////////─────▷ │ │ ///////// /////////◁────────────────▷ | └─────────────────────────────────────────────────────────────────────┘
What do the ////
represent? Does this mean you are effectively removing the time segments corresponding to "silence" from your original supervision segments? If so, perhaps this can be achieved by having interval tree operations for the SupervisionSet
class, as Piotr suggested. Once you have some defined segments, you can use cut.trim_to_supervision_groups()
instead of cut.trim_to_supervisions()
if you believe that there may be overlapping segments and you want to keep them together.
Yes //// means that we trimmed the silence and refined the supervisory intervals. In this PR, I implemented the required operations using IntervalTree to achieve the desired result. Since functionality like refine_supervision_times proposed by Peter is not yet part of the basic Cut methods, I may suggest modifying my proposed trim_inactivity workflow in the future once the corresponding functionality is implemented.
Since you already have the algorithm implemented, would you be willing to put this functionality as a SupervisionSet
method, and then this workflow can simply use it? This way, it would also allow other users to directly use that method.
Yes, of course, I am ready to implement such functionality in Cut, SupervisionSet, etc. But we need to strictly agree on how to test this functionality, and in which code points we implement it. Personally, I think this functionality is very exotic, and few people really need it directly when working with CutSet. But if you think it should be included in the backbone of the library, let's do it.
Let's see what @pzelasko has to say about this.
I'm still not sure. It looks like your example may be implemented with .truncate()/.split()
to remove the detected non-speech segments and .append()
to combine whatever cuts remained. The issue that remains is how to interpret an existing supervision segment being "masked out". Once you truncate, it will have to become two sub-segments, but unless you know the alignment, such supervision is not meaningful anymore for tasks such as ASR. Although you may want to replace these sub-segments with a new, merged supervision in the resulting MixedCut
. This could probably work and be implemented as a part of the "refine" thingy. What do you think?
I think that the main purpose of the silence detector is to remove silence from the supervised segment of audio. All of the proposed alternative approaches to full track resaving and supervising require splitting the supervised segment into parts. I believe that the operation of duplicating a supervisory segment is disruptive in any task. Supervision cannot be divisible at all if it is represented by offset and duration. I think the best way to natively implement the required functionality in lhotse is to implement the AudioSource masking mechanism. The mask could be described similarly to supervision or aligment using intervals and be a serializable part of the Recording object.
But I would go further in this idea and say that Recording could be described by a sequence of audio segments described by offset and duration. Such that when audio is loaded using load_audio from AudioSource the audio track segments are sequentially loaded and concatenated with each other. Such description will allow not only to cut segments from audio, but also to make repeated, thinned and truncated Recording. In general, this mechanism is partially already implemented in Recording, but in fact there is only one such segment.
I appreciate the discussion but the design you're suggesting is too complex and not necessary. You can already achieve sequential loading of various audio chunks using cuts. If you need to mask out some portions of the audio, you can do it post-hoc by keeping the mask interval information either as overlapping supervisions (somehow marked as special: ids, custom fields) or in the cut custom fields. However, I don't really see why you would want to mask out silence. If you want to get rid of these segments of the recording instead, you can follow the procedure I suggested above.
To clarify, here's an example (which should be generalized to arbitrary lists of supervisions if you want to go this way):
r = Recording(...)
sups = [
SupervisionSegment(..., start=2, duration=5),
]
# Assume:
# silence_segments = [
# SupervisionSegment(..., start=3, duration=2)
# ]
silence_segments = run_vad(r)
# Note: if we used silence segments to cut supervisions, the original supervision would have been split into
# two sub-segments of: start=2, duration=1 and start=5, duration=2
# Instead of splitting, we create a cut that skips the silent segment in the recording and has a new supervision
# that omits the silence:
c = r.to_cut()
new = (
c
.truncate(start=2, duration=1)
.append(
c.truncate(start=5, duration=2)
)
)
# We will now add the updated supervision information. Note:
# - we update start=0 because we removed initial silence
# - we update duration=3 because we removed the internal 2s of audio silence that the original supervision over-spanned
new.supervisions = [fastcopy(sups[0], start=0, duration=3)]