lhotse VadDataset example

Hi, I saw the VadDataset class and it is mentioned in the Readme and elsewhere. Do you know of an example setup/recipe (perhaps in other repos?) that uses it to train a VAD/segmentation model? Thanks!

May 29 '22 23:05 entn-at

CC @desh2608 you might have some recipes using those.

I wonder if I should replace VadDataset with some other one as the flagship example. At the time, we had no ASR recipes and VAD seemed both pretty well defined and conceptually simple to showcase.

May 30 '22 01:05 pzelasko

Hi I'm bumping this old thread to ask if there is an obvious way in lhotse for applying a VAD mask to the features of a cut (in the sense of removing the unvoiced framed according to a VAD mask of 0/1 for each frame) basically, this would be like the "select-voiced-frames" from kaldi that would apply a "vad.scp" to a "feats.scp" to then use the filtered features (possibly as input of an i/x-vector nnet)

thank you

Apr 25 '23 21:04 armusc

Would something like this work, assuming you have a VAD model in Python? Or are you looking for something different?

features = cut.load_features()
mask = compute_vad_mask(vad_model, features)
features = features[mask]

Apr 25 '23 23:04 pzelasko

In terms of which VAD to apply, you can use e.g. SileroVAD: https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies#examples

Actually a workflow/integration into Lhotse would be nice if somebody is willing to contribute that.

Apr 25 '23 23:04 pzelasko

Would something like this work, assuming you have a VAD model in Python? Or are you looking for something different?
features = cut.load_features()
mask = compute_vad_mask(vad_model, features)
features = features[mask]

thanks for your answer I would like to be able to batch features of Cuts and apply the VAD mask on the fly, like it was a transform; the transformed features (i.e. masked) would be the input of a nnet, like a batch formed with a k2SpeechRecognitionDataset Now, a VAD on a Cut would modify start, duration of the Cut, like perturb_speed, but features cannot be extracted from [start, start + duration], since there are holes (unvoiced frames) within it does not seem to me the input_transform would be feasible either, because the number of frames would be modified after batching I'm not sure it makes sense what I wrote above, though; maybe I don't have a proper understanding of the framework

I was also thinking: since I would like to have a concatenation of voiced frame features for a recording, could I just use a Cut with multiple supervisions, each representing the start-duration for each voiced segment? would this do the work when batches are formed or do I need strictly one supervision per cut? keep in mind that I don't have to train anything, I don't need a target, just to extract an embedding from a nnet

Apr 26 '23 16:04 armusc

I think the simplest way to get that is to write your own dataset class like this:

from lhotse.dataset.collation import collate_matrices

class EmbeddingWithVadDataset(torch.utils.data.Dataset):
  def __init__(self, ...):
    self.vad = load_vad()
  def __getitem__(self, cuts: CutSet) -> dict:
    batch_feats = []
    for cut in cuts:
      feats = cut.load_features()
      voiced_mask = self.vad(feats)
      batch_feats.append(feats[voiced_mask])
   batch_feats = collate_matrices(batch_feats)
   return {"features": batch_feats, "cuts": cuts}

It's also possible to add supervisions to indicate voiced segments, but you'll still need to add some logic that will do sth like append_cuts(cut.trim_to_supervisions()) - you could do this transform either before creating the sampler, or again inside the dataset class.

Apr 26 '23 17:04 pzelasko

It seems like you have some pre-computed VAD, and you want to apply it on-the-fly on the input features, possibly in the data-loader. I am assuming you have some kind of speaker ID system and you want to compute embeddings for full utterances (possibly containing silences) without the silence frames. Suppose you have a CutSet where each cut represents 1 utterance (or recording).

Here are 2 ways to do it:

Case 1: frame-level VAD

If you have pre-computed features for the cuts, and frame-level VAD decisions on these features, you can store the VAD decisions as a TemporalArray as follows:

with CutSet.open_writer(manifest_path) as cut_writer, LilcomChunkyWriter(
    storage_path
) as vad_writer:
    for cut in cuts:
        vad_decisions = vad.run(cut) # vad_decisions is an np.ndarray
        cut.vad = vad_writer.store_array(cut.id, vad_decisions)
        cut_writer.write(cut)

Then, in your data-loader, these can be loaded by calling cut.load_vad() (thanks to the magic of the custom attribute), and applied on the feats loaded from cut.load_features() using the indexing that Piotr described.

Case 2: Segment-level VAD

It may happen that your VAD generates segments (in the form <start, end>) instead of frame-level decisions. You can create SupervisionSegments from each such segment (for each recording), and put them in cut.supervisions. Note that each cut can contain multiple supervisions.

Then, in your data-loader, you can do something like the following:

cut_segments = cut.trim_to_supervisions(keep_overlapping=False)
feats = []
for c in cut_segments:
    feats.append(c.load_features())
feats = np.concatenate(feats, axis=0)

Note that this assumes that the segments returned by your VAD model are non-overlapping (it doesn't really make sense to have overlapping VAD segments anyway).

Apr 26 '23 17:04 desh2608

great, thanks for the suggestions

Apr 27 '23 07:04 armusc

lhotse lhotse copied to clipboard

VadDataset example

Case 1: frame-level VAD

Case 2: Segment-level VAD

lhotse
lhotse copied to clipboard