audio icon indicating copy to clipboard operation
audio copied to clipboard

Metadata mode for torchaudio.dataset

Open leo19941227 opened this issue 3 years ago • 6 comments

🚀 The feature

Hello,

Thanks for the handy tools for parsing the database! I am wondering if it is possible to let the torchaudio.dataset classes have two modes:

  1. Return the actual waveform
  2. Return the waveform path

Others metadata like transcriptions and speaker labels can just remain the same.

Motivation, pitch

Since usually we would like to organize a corpus into some standardized data format (like the kaldi's data directory), before loading the actual waveform. So we can have more flexibility on how many training data points we want to use, or some preprocessing steps which might be hard to do in __getitem__ and collate_fn. For example, In SUPERB ASV, if an utterance in VoxCeleb1 is not longer than 2 seconds after VAD, it is discarded and won't be used for training (There is no utterance in this case for the testing set). In the current dataset class implementation, I not yet come out a feasible way to realize this kind of logic since the __len__ is fixed.

Furthermore, by switching to waveform path mode, it becomes really fast to iterate over the entire dataset. Then, I can re-use the dataset classes as the corpus directory parsing helper, easily obtain all the transcription in LibriSpeech and train tokenizers, or easily obtain all the speakers in VoxCeleb1 to make a Categorical encoder. In the current implementation which returns the actual waveforms, it can become too slow to iterator over the entire dataset and is not tolerable for just preprocessing steps.

Alternatives

No

Additional context

No response

leo19941227 avatar Jul 11 '22 15:07 leo19941227

I think the proposal makes sense. Decoupling dataset traversal and audio decoding makes it flexible.

For the second part, I think that the most flexible approach is to have explicit table representation of dataset internally, and apply operations like "drop rows that do not meet this criteria" and "split the dataset into batches with similar length" on it. I get the feeling that this will also give user code opportunity to improve the overall throughput.

The existing TorchAudio datasets started as simple adaptation of classic torch.utils.Dataset, which was not designed for advanced usage like these. I think it's good time for us to move away from it.

mthrok avatar Jul 11 '22 16:07 mthrok

Cool!

leo19941227 avatar Jul 11 '22 17:07 leo19941227

Hi @leo19941227, we can add the metadata feature under prototype. One potential solution is adding fetch_metadata method for each dataset. The returned item is mostly identical, only the waveform is replaced by the absolute path of the audio. Does that suit your need?

nateanl avatar Jul 25 '22 21:07 nateanl

Hello @nateanl ,

Cool! I think this is nice.

leo19941227 avatar Jul 25 '22 23:07 leo19941227

Hi @leo19941227, we've added metadata mode support for LibriSpeech in #2653, let us know if this looks good for your use case and we'll work to add support for the other datasets in the upcoming weeks as well (to be released as part of 0.13).

carolineechen avatar Sep 12 '22 18:09 carolineechen

Hi @carolineechen,

Thanks, it looks good to me!

leo19941227 avatar Sep 12 '22 18:09 leo19941227

If a new Dataset abstraction is implemented into torchaudio, please consider known problems of storing data/strings as native python objects for giant datasets: https://github.com/speechbrain/speechbrain/issues/872

https://github.com/pytorch/pytorch/issues/13246

vadimkantorov avatar Oct 19 '22 20:10 vadimkantorov