GigaSpeech icon indicating copy to clipboard operation
GigaSpeech copied to clipboard

Is XL subset the 33000hr unlabeled data?

Open mct10 opened this issue 2 years ago • 1 comments

Hi, As mentioned in the README, GigaSpeech contains "33,000+ hours for unsupervised/semi-supervised learning". I am trying to use these unlabeled data, and I have already downloaded the XL subset. But after I summed up the duration of each audio in GigaSpeech.json, the number is only around 25000 hour. So my question is, is the entire XL subset the 33,000 hour data, or are there any additional steps needed to retrieve the 33000 hour data? Many thanks!

mct10 avatar May 23 '22 12:05 mct10

  • There are 33000+ hours audio files in total under GigaSpeech directory.
  • GigaSpeech.json contains 10000 hours of audio segments with transcription for supervised training.

dophist avatar Jun 07 '22 13:06 dophist